[k8s] better support for GKE scale-to-zero autoscaling node pools #4935

SeungjinYang · 2025-03-12T18:07:44Z

Addresses #4875

Currently, if a node autoscaler is configured in a k8s cluster, the only thing skypilot knows about the autoscaler is the configuration provided by the user. Specifically, skypilot has no idea if there is a node pool with the node type that may be able to handle a job that has simply been autoscaled to zero. Currently, skypilot gets around this by simply submitting a pod to each context with autoscaler enabled and seeing if the pod is scheduled before timeout.

While this approach is functional, it is inefficient because:

A context (=cluster) may have an autoscaling node pool, but the node pool may not provide the VM needed to satisfy the request. For example, there may be an autoscaler on a node pool with A100 GPU VMs - skypilot doesn’t know this, only that there is an autoscaler group, and will try to launch H100 resources on it.
The autoscaling node pool may have the correct accelerator but different number of accelerators, CPU / memory constraints. For example, a node pool that spins up VMs with 1 A100 cannot handle launch requests with A100:8, but again skypilot doesn’t know that.
If there are multiple allowed contexts, and only some of them have autoscalers on them, there is no way for skypilot to know that. So skypilot may try to schedule a pod on a context w/o an autoscaler that cannot schedule the said pod.
^ note on above: the k8s autoscaler configuration is global, not per-context. A per-context autoscaler config could also solve this specific bullet point.

This PR attempts to solve these challenges for GKE autoscaler specifically. This is done by querying each context for its node pools, detecting if any node pool has autoscaling configured, and checking if any node can be spun up that can satisfy the job request.

Assumptions in code:

Context name follows convention of gke_PROJECT-ID_ZONE_CLUSTER-NAME. If not, skypilot will fallback to legacy codepath.
Customer has GCP auth set up for skypilot to query for GKE cluster details. If not, skypilot will fallback to legacy codepath.

Testing

Known failure case: skypilot[gcp] is not installed

% sky launch --gpus tpu-v5-lite-device:1 --cpus 32 --cloud kubernetes
Could not fetch autoscaler information from GKE. Run pip install "skypilot[gcp]" for more intelligent pod scheduling with GKE autoscaler.
Considered resources (1 node):
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                             vCPUs   Mem(GB)   ACCELERATORS           REGION/ZONE                                           COST ($)   CHOSEN   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   32CPU--128GB--tpu-v5-lite-device:1   32      128       tpu-v5-lite-device:1   gke_<project>_us-central1-a_skypilot-test-cluster   0.00          ✔     
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-56a0-seungjinyang'. Proceed? [Y/n]:
...
% sky launch --gpus a100:1 --cloud kubernetes                                                               
Could not fetch autoscaler information from GKE. Run pip install "skypilot[gcp]" for more intelligent pod scheduling with GKE autoscaler.
Considered resources (1 node):
---------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                                           COST ($)   CHOSEN   
---------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   2CPU--8GB--A100:1   2       8         A100:1         gke_<project>_us-central1-a_skypilot-test-cluster   0.00          ✔     
---------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-e17b-seungjinyang'. Proceed? [Y/n]:

As seen, GPUs that are not on the autoscaled node pool still attempts to be scheduled; this is consistent with legacy code behavior.

Known failure case: context name does not follow gke standard context format.
Tested on a GKE cluster with a node pool of ct5l-hightpu-1t (exposing 1xtpu-v5-lite-device per node).

% kubectx test=.
Context "gke_<project>_us-central1-a_skypilot-test-cluster" renamed to "test".
% sky launch --gpus tpu-v5-lite-device:1 --cpus 32 --cloud kubernetes
Considered resources (1 node):
------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                             vCPUs   Mem(GB)   ACCELERATORS           REGION/ZONE   COST ($)   CHOSEN   
------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   32CPU--128GB--tpu-v5-lite-device:1   32      128       tpu-v5-lite-device:1   test          0.00          ✔     
------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-becc-seungjinyang'. Proceed? [Y/n]: 
...
% sky launch --gpus a100:1 --cloud kubernetes              
Considered resources (1 node):
-----------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
-----------------------------------------------------------------------------------------------------
 Kubernetes   2CPU--8GB--A100:1   2       8         A100:1         test          0.00          ✔     
-----------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-0301-seungjinyang'. Proceed? [Y/n]:

As seen, GPUs that are not on the autoscaled node pool still attempts to be scheduled; this is consistent with legacy code behavior.

Test newly introduced functionality
Tested on a GKE cluster with a node pool of ct5l-hightpu-1t (exposing 1xtpu-v5-lite-device per node).

sky launch --gpus tpu-v5-lite-device:4 --cloud kubernetes    
No resource satisfying Kubernetes({'tpu-v5-lite-device': 4}, accelerator_args={}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'tpu-v5-lite-device': 4}, accelerator_args={}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.
% sky launch --gpus a100:1 --cloud kubernetes
No resource satisfying Kubernetes({'A100': 1}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'A100': 1}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.
% sky launch --gpus tpu-v5-lite-device:1 --cloud kubernetes
Considered resources (1 node):
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                          vCPUs   Mem(GB)   ACCELERATORS           REGION/ZONE                                           COST ($)   CHOSEN   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   2CPU--8GB--tpu-v5-lite-device:1   2       8         tpu-v5-lite-device:1   <project>_us-central1-a_skypilot-test-cluster   0.00          ✔     
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-169e-seungjinyang'. Proceed? [Y/n]:

As seen, skypilot only proceeds to schedule a pod on the cluster if the correct accelerator is present.

caveat
context: the A100 node deployed on the cluster has 12 vcpus and 85 gig of mem.

% sky launch --gpus a100:1 --cpus 12 --memory 85  --cloud kubernetes
Considered resources (1 node):
-----------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE              vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                                           COST ($)   CHOSEN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   12CPU--85GB--A100:1   12      85        A100:1         gke_<project>_us-central1-a_skypilot-test-cluster   0.00          ✔ 
-----------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-6d90-seungjinyang'. Proceed? [Y/n]:
Aborted!
% sky launch --gpus a100:1 --cpus 13 --memory 85 --cloud kubernetes
No resource satisfying Kubernetes(cpus=13, mem=85, {'A100': 1}) on Kubernetes.
Try specifying a different CPU count, or add "+" to the end of the CPU count to allow for larger instances.
Try specifying a different memory size, or add "+" to the end of the memory size to allow for larger instances.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes(cpus=13, mem=85, {'A100': 1}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.
% sky launch --gpus a100:1 --cpus 12 --memory 86 --cloud kubernetes
No resource satisfying Kubernetes(cpus=12, mem=86, {'A100': 1}) on Kubernetes.
Try specifying a different CPU count, or add "+" to the end of the CPU count to allow for larger instances.
Try specifying a different memory size, or add "+" to the end of the memory size to allow for larger instances.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes(cpus=12, mem=86, {'A100': 1}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

As seen here, skypilot will block requests on autoscaling node pool with GPUs that exceeds the node cpu and memory.

However, this is not true for requests on autoscaling node pool with TPUs.
context: the vt TPU node deployed on the cluster has 24 vcpus and 48 gig of mem.

% sky launch --gpus tpu-v5-lite-device:1 --cpus 24 --memory 48 --cloud kubernetes
Considered resources (1 node):
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                            vCPUs   Mem(GB)   ACCELERATORS           REGION/ZONE                                           COST ($)   CHOSEN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   24CPU--48GB--tpu-v5-lite-device:1   24      48        tpu-v5-lite-device:1   gke_<project>_us-central1-a_skypilot-test-cluster   0.00          ✔
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-b28a-seungjinyang'. Proceed? [Y/n]: n
Aborted!
% sky launch --gpus tpu-v5-lite-device:1 --cpus 999 --memory 999 --cloud kubernetes
Considered resources (1 node):
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                              vCPUs   Mem(GB)   ACCELERATORS           REGION/ZONE                                           COST ($)   CHOSEN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   999CPU--999GB--tpu-v5-lite-device:1   999     999       tpu-v5-lite-device:1   gke_<project>_us-central1-a_skypilot-test-cluster   0.00          ✔
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-5732-seungjinyang'. Proceed? [Y/n]: n

Requests that specify more cpus/mem than the TPU pod allows still goes through.

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (see above)
Relevant individual tests: /smoke-test --kubernetes
Backward compatibility: /quicktest-core

sky/clouds/kubernetes.py

sky/provision/kubernetes/utils.py

SeungjinYang

Moving review comments to new locations after the last commit

sky/provision/kubernetes/utils.py

sky/clouds/kubernetes.py

sky/provision/kubernetes/utils.py

sky/clouds/kubernetes.py

sky/provision/kubernetes/utils.py

sky/clouds/kubernetes.py

sky/provision/kubernetes/utils.py

sky/clouds/kubernetes.py

sky/provision/kubernetes/utils.py

sky/clouds/kubernetes.py

SeungjinYang · 2025-03-14T20:03:45Z

/quicktest-core

sky/provision/kubernetes/utils.py

cg505

Since there are a lot of conditionals/logic paths, I think more debugging logs (ideally at almost every branch) would be super helpful in the future when we have to debug some customer issues with this.

sky/clouds/kubernetes.py

sky/provision/kubernetes/utils.py

SeungjinYang · 2025-03-15T01:44:11Z

I've added a good bit of debug logs based on my experience developing this PR.

SeungjinYang · 2025-03-16T04:24:18Z

quicktest-core and smoke-test --kubernetes are passing as of this commit

cg505

Looks good now! Left two minor comments you can address before merging.

sky/provision/kubernetes/utils.py

… info request

…_TYPE_TO_AUTOSCALER

Michaelvll

Thanks @SeungjinYang and @cg505 for getting this in! I am leaving some comments for some minor issues. : )

Michaelvll · 2025-03-18T22:52:25Z

sky/provision/kubernetes/utils.py

+
+    # This variable is stored in memory in the server.
+    # The variable will reset if the server restarts.
+    _pip_install_gcp_hint_last_sent = 0.0


This won't persist across multiple invokes of SkyPilot. Wondering what is the purpose of this?

This works empirically (i.e. the timeout is respected across multiple sky invocations), except the first invocation after sky api stop;sky api start. This is because GKEAutoscaler is instantiated on the server side and this variable is not reset unless the server is reset.

Michaelvll · 2025-03-18T22:54:14Z

sky/provision/kubernetes/utils.py

+        for node_pool in cluster['nodePools']:
+            logger.debug(f'checking if node pool {node_pool["name"]} '
+                         'has autoscaling enabled.')
+            if (node_pool['autoscaling'] is not None and
+                    'enabled' in node_pool['autoscaling'] and
+                    node_pool['autoscaling']['enabled']):
+                logger.debug(
+                    f'node pool {node_pool["name"]} has autoscaling enabled. '
+                    'Checking if it can create a node '
+                    f'satisfying {instance_type}')
+                if cls._check_instance_fits_gke_autoscaler_node_pool(
+                        instance_type, node_pool):
+                    return True


Just curious, if there is any chance we will get a KeyError for these callings? Since the autoscaler check is optional optimization, we may want to make sure it does not fail in any unexpected cases.

This should never return a KeyError, but I know better than to rely on should. I'll just execute this code block with a try catch on KeyError and fallback to scheduling pods optimistically if there is one.

Michaelvll · 2025-03-18T22:58:52Z

sky/provision/kubernetes/utils.py

+    def _validate_context_name(cls, context: str) -> Tuple[bool, str, str, str]:
+        """Validates the context name is in the format of
+        gke_PROJECT-ID_LOCATION_CLUSTER-NAME
+        Returns:
+            bool: True if the context name is in the format of
+                gke_PROJECT-ID_LOCATION_CLUSTER-NAME
+            str: project id
+            str: location
+            str: cluster name
+        """
+        context_components = context.split('_')
+        if len(context_components) != 4 or context_components[0] != 'gke':
+            logger.debug(
+                f'context {context} is not in valid GKE context format.')
+            return False, '', '', ''
+
+        logger.debug(f'context {context} is in valid GKE context format.')
+        return True, context_components[1], context_components[
+            2], context_components[3]


Checking the context name is not a very robust solution, since a user can always change the context name manually. Is there a better way to check if a cluster is on GKE?

This also relates to the robustness comment above. If a context name is set to start with gke_, will it cause issue of KeyError above?

We use the parsed context information to query the GCP backend for the cluster info. If the context name passes the parser but isn't actually a valid GKE context, the parsed context would fail to be retrieved from GCP. In that case we fall back to optimistically scheduling pods.

Michaelvll · 2025-03-18T23:01:57Z

sky/provision/kubernetes/utils.py

+        to fit the instance type.
+        """
+        for accelerator in node_pool_accelerators:
+            node_accelerator_type = GKELabelFormatter. \


minor: please avoid using \, use () instead

Michaelvll · 2025-03-18T23:29:28Z

sky/provision/kubernetes/utils.py

+        if (len(machine_type_parts) != 3 or
+                not machine_type_parts[0].startswith('ct') or
+                machine_type_parts[1] != 'hightpu' or
+                not machine_type_parts[2].endswith('t') or
+                not machine_type_parts[2].strip('t').isdigit()):


Hmm, this is a bit hard to read. Would it be possible to just use a regex for this?

I got regex to work indeed

* review comments * use .get() where it makes sense

…ypilot-org#4935) * working codepath * remove prints and an assert * make into classes * minor changes * update codepath comment * lint * slight reformat * review feedback * autoscale_detecror -> autoscaler * unnest regions_with_offering logic * short circuit on unsupported autoscaler * formalize context name validation, add exception handling for cluster info request * account for TPUs * code hardening * remove AUTOSCALER_TO_LABEL_FORMATTER in favor of expanding AUTOSCALER_TYPE_TO_AUTOSCALER * more debug logs, review feedbacks * final review comments addressed

* review comments * use .get() where it makes sense

* Add Nebius storage integration and associated utilities This commit introduces support for Nebius object storage, enabling users to integrate Nebius storage with various functionalities such as file mounting, syncing, and cloud transfers. It includes necessary adaptors, utility methods, and updates to the storage framework to handle Nebius-specific configurations and endpoints. * Add support for Nebius storage mounting and testing Implemented new Nebius storage mounting functionality, including COPY and MOUNT modes, with tests. Updated cloud_stores.py for AWS CLI compatibility and added a YAML template for Nebius storage configurations. Removed outdated Nebius test case in favor of the new approach. * format * Add Nebius object storage support across tests and utilities This commit introduces comprehensive Nebius support, making it accessible for S3-compatible operations including bucket creation, deletion, and mounting. It removes reliance on the AWS_SHARED_CREDENTIALS_FILE environment variable, streamlines Nebius-specific configurations, and adds associated unit test parameters to validate functionality across storage operations. * fix * typo * Refactor Nebius adaptor and improve clarity. Remove redundant code, streamline imports, and enhance error messaging. Adjust documentation for better accuracy and update function annotations. These changes improve maintainability and readability of the Nebius adaptor module. * Refactor Nebius storage setup and clean up debug print. Simplified Nebius AWS CLI installation by reusing the S3CloudStorage configuration for consistency. Removed unnecessary debug print in `run_upload_cli` to reduce console noise. Minor formatting adjustment in test YAML file. * Refactor Nebius storage handling and add timeout for deletions Clean up and improve code readability, including string formatting and conditionals. Introduce `_TIMEOUT_TO_PROPAGATES` to handle timeout while verifying Nebius bucket deletions. Update comments to reflect the corrected usage on Nebius servers. * Refactor subprocess call and improve timeout error messaging. Removed unused variable from subprocess call to clean up code. Updated timeout error to include the bucket name for more detailed and helpful error reporting. * Set default region for Nebius Object Storage if none provided Updated the helper method to assign a default region when no region is specified, ensuring compatibility with Nebius Object Storage. This change avoids potential errors caused by missing region values. * Support Nebius URLs in file sync commands Replace 'nebius://' with 's3://' in source paths to ensure compatibility with AWS CLI commands. This allows seamless integration of Nebius storage endpoints. * [Docs] Add quick start to k8s getting started docs (#4799) * k8s quick start * title * [Docs] New "Examples" section (#4858) * WIP: Examples dropdown. * update new * WIP * local render is fine; need to add files * test pip * fix * add missing * add missing * add missing * add missing * try .mddmissing * add missing * updates * updates * Instructions * updates * cleanup * fix * updates * lint * updates * add RAG * RAG new * add missing * refactor * fix redirection and warnings * generate before build * remove uneccessary source * minor * remove generated examples * fix header * priorize readme file * avoid remove * format * update README * update links * try fix stem/name * add paper * updates * update task -> skypilot yaml * source/generate_examples.py: revert to .stem --------- Co-authored-by: Zhanghao Wu <[email protected]> * [API Server] Fix admin policy enforcement on `validate` and `optimize` (#4820) * Add admin policy to validate * Add admin policy to validate * Add admin policy to optimize * docs * imports * Move dag validation to core * Fixes * lint * Add comments * lint * Fixed executor based validate implementation * Revert executor based validate implementation * lint * lint * Add validation during optimize * lint * Remove validate from core * Remove admin policy apply when validating dag for exec * comments * Bump API version * comments * [Core] Exit with non-zero code on launch/exec/logs/jobs launch/jobs logs (#4846) * Support return code based on job success/failure * Return exit code for tailing managed jobs * Fixes * lint * Create JobExitCode enum * Get JobExitCode from ManagedJobStatus * lint * cleanup * cleanup * Add tests * lint * Managed jobs back compat * Skylet backward compatibility * lint * Update logs --status returncodes * Update logs --status returncodes * lint * fix retcode * Fix tests * lint * Fix --no-follow * Fix cli docs rendering * minor * rename ret_code to returncode * rename SUCCESS to SUCCEEDED * Refactor JobExitCode to exceptions * lint * [Storage] Fix storage deletion for all (#4872) Fix storage deletion for all * [Docs] Avoid back links in FAQ (#4866) Avoid back links * Serve log before termination for smoke tests (#4691) * serve log before termination * restore change * replace command * fix * add sky serve status * [Dashboard] Fix Log Download (#4844) * download preview * refactor log content column * fix column issue * [jobs] catch NotSupportedError for `sky down --purge` (#4811) Fixes #4626. * [Test] fixed managed job return code with --no-follow for compatibility test (#4887) * [Test] fixed backward compatibility test Signed-off-by: Aylei <[email protected]> * lint Signed-off-by: Aylei <[email protected]> * temp test Signed-off-by: Aylei <[email protected]> * revert temp change Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> * show managed jobs user column in `sky status -u` (#4889) * [Examples] Rename airflow DAG (#4898) * Rename to sky_train_dag * rename * [API server] honor SKYPILOT_DEBUG env in server log (#4883) Signed-off-by: Aylei <[email protected]> * [jobs] resolve jobs queue user on API server side (#4897) * [jobs] resolve jobs queue user on API server side * lint * note user_name is optional * Updates the vast catalog to write directly to the vms.csv (#4891) Previously this file emitted to sys.stdout which prevented the catalog-fetcher from actually updating the catalog. This has now been updated matching many of the patterns employed by other vendors in this directory. * [Docs] Minor updates to installation.rst (#4888) * [Docs] K8s docs updates (#4902) Fixes to k8s docs * [jobs] fix dashboard for remote API server (#4895) * [jobs] fix dashboard for remote API server * fix for k8s * [docs] add jobs controller resource tuning reference in config page (#4909) * [Core] Handle mid-sequence chunking in log streaming (#4908) * Handle mid-sequence chunking * format * Handle actual UnicodeDecodeError * lint * Exclude `.pyc` and `__pycache__` files from config hash calculation to fix `test_launch_fast --kubernetes` failures (#4880) * filter out pyc and pycache * filter out pyc and pycache * handle edge case * None for the case where file might have been deleted after listing * add comment * [Docs] Add a few more examples for k8s. (#4911) * Add some new Example links. * Finetune landing/README. * Updates * No fork button * [Docs] Add team deployment in existing machine and `detach_run` in docs (#4913) * Indicate remote API server for jobs * Add api deployment and detach_run in docs * avoid console for better copy paste * avoid more console * fix * rename * update doc * format * revert * Update docs/source/reservations/existing-machines.rst Co-authored-by: Zongheng Yang <[email protected]> --------- Co-authored-by: Zongheng Yang <[email protected]> * update PR template to use CI tests (#4917) * update template * update * no bold * [UX] Auto-exclude unavailable kubernetes contexts (#4692) * [UX] Exclude stale kubernetes context - Improve Kubernetes context and node retrieval error handling - Add context-aware retry mechanism for Kubernetes API calls * catch broad error Signed-off-by: Aylei <[email protected]> * track unavailable contexts Signed-off-by: Aylei <[email protected]> * typo Signed-off-by: Aylei <[email protected]> * remove irrelevant change Signed-off-by: Aylei <[email protected]> * address review comments Signed-off-by: Aylei <[email protected]> * Update sky/clouds/kubernetes.py Co-authored-by: Zhanghao Wu <[email protected]> * address review comments Signed-off-by: Aylei <[email protected]> * address review comments Signed-off-by: Aylei <[email protected]> * cover unreachable context in smoke test Signed-off-by: Aylei <[email protected]> * cover unreachable context in smoke test Signed-off-by: Aylei <[email protected]> * fix post cleanup in multi-k8s Signed-off-by: Aylei <[email protected]> * more comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> * [API server] accelerate start by slowly start workers (#4885) * [API server] accelerate start by slowly start workers Signed-off-by: Aylei <[email protected]> * Address review comments Signed-off-by: Aylei <[email protected]> * always close Signed-off-by: Aylei <[email protected]> * Address review comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> * more permissive match for k8s accelerators (#4925) * case insensitive match for k8s accelerators * fix typo in canonicalization func * format * [core] if not all nodes are in ray status, double check after 5s (#4916) * [core] if not all nodes are in ray status, double check after 5s * add a comment explaining the situation more * [Docs] Remove `networking: nodeport` from config docs (#4928) Remove `networking: nodeport` from config * [Core] Fix failover handler for clouds moved to new provisioner (#4919) * Fix failover handler * remove unused handler * [Test] Cost down smoke tests (#4813) * change cpu to 2+ and memory to 4+ * remove some resource heavy * update yaml * intermediate bucket yaml * cloud aws for test_managed_jobs_pipeline_recovery_aws * pipeline yaml update * fix * fix * larger the size of kube * resource heavy * test skyserve_update * test skyserve_update * fix kubernetes test failure * skyserve_streaming * more kubernetes high resource test * restore azure of test_skyserve_rolling_update * restore azure change * restore change * restore test_skyserve_rolling_update * bug fix: * fix yaml * v100 does not require low resource * no special resource for kubernetes tests * add more for master test * test_multi_tenant_managed_jobs low resource * managed_job_storage * longer timeout for kube * resolve PR comment * rename function * Add linting for sentence case in Markdown and reST headings (#4805) * linting * subtitle * draft linting * update linting script * title lowercase * fix * pass build * simplified logic * resolve review comment * resolve review comment * restore change * resolve comment * [Core] sky exec now waits cluster to be started (#4867) * [Core] sky exec now waits cluster to be started Signed-off-by: Aylei <[email protected]> * add smoke test case Signed-off-by: Aylei <[email protected]> * refine smoke Signed-off-by: Aylei <[email protected]> * fix smoke test Signed-off-by: Aylei <[email protected]> * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * address review comments Signed-off-by: Aylei <[email protected]> * Address review comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> * [Docs] Minor: pull up a page. (#4929) * `Fix Nebius integration issues and update storage error message` Updated the `create_endpoint` function to ensure the `region` parameter is strictly typed as `str`. Modified `create_nebius_client` to accept `None` as the default region. Additionally, corrected the error message in `storage.py` to specify 'nebius' instead of 's3'. * typo * Refactor storage handling and update R2 credentials usage Updated R2 command to explicitly set AWS_SHARED_CREDENTIALS_FILE for better credential management. Simplified region assignment logic in storage initialization to improve code readability and maintainability. * Refactor Nebius-related code for clarity and correctness Ensure Nebius paths are properly validated and transformed, replacing `if` checks with assertions. Fixed default region handling in `create_endpoint` and corrected variable naming in `split_nebius_path` for consistency. These changes enhance code reliability and maintainability. * Refactor SDK initialization to use a cached global instance. Introduce a global `_sdk` variable to cache the SDK instance, preventing redundant initialization. This improves efficiency by avoiding repeated calls to `nebius.sdk.SDK()` in the `sdk()` function. The logic ensures `_sdk` is only initialized once, either with IAM credentials or a credentials file. * Update bucket URI format in mount and storage test Replaced the `bucket_uri` returned in the test with a prefixed `nebius://` format. This ensures consistency with updated storage access conventions. * format * [Jobs][UX] add -all option to jobs queue printing (#4923) * add all option * formatting * fix comments * Refactor jobs queue display logic and improve job listing * [deps] pin ibm-platform-services to >=0.48.0 to work around issue (#4939) * [api server] avoid deleting requests.db but not -wal/-shm (#4941) [api server] avoid deleting requests.db without -wal/-shm * [Test] Fix kubernetes failure tests (#4874) * resource_heavy for test_multi_tenant_managed_jobs * longer initial delay and resource_heavy * test launch fast * test again * more test * more log * more log * more log * more log * more log * restore log * remove resource heavy * restore change * longer initial delay * wait for NOT_READY for test_skyserve_rolling_update test * remove unuse import * increase the sleep to 120 * f format * fix test_managed_jobs_storage and test_kubernetes_storage_mounts * restore deleted test * restore more * remove resource_heavy * test * test again * fix azure check * test one more time * test one more time * Revert "test one more time" This reverts commit 029a3a7. * Revert "test one more time" This reverts commit fa70b8f. * Revert "test again" This reverts commit 3480116. * Revert "test" This reverts commit c695b56. * fix * add comment * no spot for kubernetes test * no spot * bigger initial delay * longer initial delay * check if its eks cluster * fix bool arg * [k8s] filter out nodes with less accelerators than requested (#4930) * filter out nodes in gke with less accelerators than requested * address comments * gpu check executes on non-tpu nodes * [Jobs] Error out for intermediate bucket on cloud not enabled (#4942) * Error out for intermediate bucket on cloud not enabled * better logging for reauth error * Add reauth exception * format * [Docs] Add docs on implementing priorities in k8s (#4803) * Add priorities page * Address comments, add to k8s setup docs * fixes * Fixes * [Docs] Minor wording changes. (#4940) * wip * updates * reword * add * [Examples] LLM/Gemma3 Example (#4937) * gemma3 * Update gemma3.yaml to specify exact versions for transformers and vllm installations; add readiness probe configuration in service section. Update README.md to correct command option from 'deepseek' to 'gemma-3'. * Remove outdated command option from README.md for clarity. * update readme for serving * Update README.md to correct the number of nodes in the service launch command and update the link to the appropriate SkyPilot YAML file. * [Doc] Gemma3 doc update (#4948) Update README and documentation to include Gemma 3 model and example. Added Gemma 3 to the news section in README.md and updated the models index in documentation. * [Docs] Update k8s volume mounting docs + refactor optional steps (#4934) * Update volume mounting docs * Update volume mounting docs * Update volume mounting docs * Add nested tabset, restructure optional steps * Move volume mounting docs * Update volume mounting docs * Reorder * casing * Comments * fix * reduce links * [Test] Simplified buildkite agent queue (#4932) * remove serve and backcompact * ignore buildkite yaml file * [Docs] Update benefits for client-server (#4945) * Update benefits for client-server * update * Update docs/source/reference/api-server/api-server.rst Co-authored-by: Zongheng Yang <[email protected]> --------- Co-authored-by: Zongheng Yang <[email protected]> * Fix flaky for test_cancel_launch_and_exec_async (#4966) * fix flaky for test_cancel_launch_and_exec_async * comma * use generic_cloud * new line format * [Docs] fix typo in gemma3 example (#4971) Signed-off-by: Aylei <[email protected]> * [k8s] better support for GKE scale-to-zero autoscaling node pools (#4935) * working codepath * remove prints and an assert * make into classes * minor changes * update codepath comment * lint * slight reformat * review feedback * autoscale_detecror -> autoscaler * unnest regions_with_offering logic * short circuit on unsupported autoscaler * formalize context name validation, add exception handling for cluster info request * account for TPUs * code hardening * remove AUTOSCALER_TO_LABEL_FORMATTER in favor of expanding AUTOSCALER_TYPE_TO_AUTOSCALER * more debug logs, review feedbacks * final review comments addressed * fix incorrect vcpu/mem checks for GKE autoscaler (#4972) * fix annotation "kubernetes.io/ingress.class" is deprecated (#4974) * fix annotation "kubernetes.io/ingress.class" is deprecated Signed-off-by: Ajay-Satish-01 <[email protected]> * fix: ingress spec based on version --------- Signed-off-by: Ajay-Satish-01 <[email protected]> * [UX] Fix dense cli for resources not enough (#4962) fix dense cli * [API server] attach setup of controllers (#4931) * [API server] attach setup of controllers Signed-off-by: Aylei <[email protected]> * lint Signed-off-by: Aylei <[email protected]> * Address review comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> * [Test] Add support for missing bashrc file in zsh shells (#4963) * add support for zsh * fix for bashrc after testing * [Docs] Fix NFS mounting docs for k8s (#4951) Add kubernetes key * [k8s] GKE support for TPU V6 (#4986) * [k8s] GKE support for TPU V6 * gke t6 support * remove wrong check * Fix test_managed_jobs_storage failure on azure in master branch (#4965) * fix * longer timeout * [Test]Separate different param into different steps on buildkite and fix flacky of test_job_queue_with_docker (#4955) * different param to different steps * longer time to sleep * [API server] cleanup executor processes on shutdown (#4912) * [API server] cleanup executor on shutdown Signed-off-by: Aylei <[email protected]> * refine Signed-off-by: Aylei <[email protected]> * just raise impossible exceptions Signed-off-by: Aylei <[email protected]> * Update sky/utils/subprocess_utils.py Co-authored-by: Zhanghao Wu <[email protected]> --------- Signed-off-by: Aylei <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> * [k8s] LRU cache for GKE can_create_new_instance_of_type (#4973) * LRU cache for can_create_new_instance_of_type * request scope * [Test]Refactor backward compatibility test (#4906) * backward compat * fix * backcompat update * generate pipeline * bug fix * remove deactivate * robust backcompact test * fix * more log * bug fix * subprocess run with bash * bug fix * update template * fix flaky * limit concurrency * pip install uv * fix * low resource * fix * bump python version to 3.10 * recreate env * import order * [Core] Independent storage check (#4977) * independent storage check * formatting * granular perms * _is_storage_cloud_enabled uses storage check * UX improvement * remove debug logs * fix local test * sky check no regression * no sky check regression, managed jobs work * api backwards compatibility * define globally minimal perms for gcp * review feedback * continue from except * [GCP] Don't require TPU support for serve:gcp if TPU support is not required (#4991) don't require tpu support for serve:gcp if tpu support is not required * followup to #4935 (#4989) * review comments * use .get() where it makes sense * [Serve] BugFix: `any_of` field order issue cause version bump to not work (#4978) * [Serve] BugFix: `any_of` field order issue cause version bump to not work * upd * [Example] Batch Inference (#4994) * initial code for batched inference * Refactor batch inference scripts and configuration files for improved consistency and clarity. Removed unused bucket name generation and monitoring service launch from `batch_compute_vectors.py`. Updated `compute_text_vectors.yaml` and `monitor_progress.yaml` to use a unified bucket name. Revised README to reflect changes in embedding generation focus and performance highlights. * formattting * Update batch inference configuration files for consistency. Renamed `compute_text_vectors` and `monitor_progress` to include `batch-inference` prefix. Revised README to enhance clarity on monitoring progress and accessing the service. * Update README for batch inference: replaced local image links with external URLs, corrected endpoint variable in monitoring instructions, and added a new image for enhanced visual representation. * Update README for batch inference: corrected image URLs to include file extensions and added a new section for further learning resources. * Enhance README for batch inference: updated the section on computing embeddings to include details about the Amazon reviews dataset and clarified the use of the `Alibaba-NLP/gte-Qwen2-7B-instruct` model for generating embeddings. * update banner --------- Signed-off-by: Aylei <[email protected]> Signed-off-by: Ajay-Satish-01 <[email protected]> Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: zpoint <[email protected]> Co-authored-by: Kaiyuan Eric Chen <[email protected]> Co-authored-by: Christopher Cooper <[email protected]> Co-authored-by: Aylei <[email protected]> Co-authored-by: chris mckenzie <[email protected]> Co-authored-by: Seung Jin <[email protected]> Co-authored-by: Ajay Satish <[email protected]> Co-authored-by: Daniel Shin <[email protected]> Co-authored-by: Tian Xia <[email protected]>

SeungjinYang requested review from cg505 and romilbhardwaj March 12, 2025 18:07

SeungjinYang force-pushed the k8s-gke-autoscaler branch from e8c6353 to 91fb916 Compare March 12, 2025 18:08

SeungjinYang commented Mar 12, 2025

View reviewed changes

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved

SeungjinYang commented Mar 12, 2025

View reviewed changes

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved

SeungjinYang commented Mar 12, 2025

View reviewed changes

cg505 reviewed Mar 12, 2025

View reviewed changes

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved

cg505 reviewed Mar 12, 2025

View reviewed changes

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved

cg505 reviewed Mar 12, 2025

View reviewed changes

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved

SeungjinYang force-pushed the k8s-gke-autoscaler branch 3 times, most recently from 40839da to c702207 Compare March 12, 2025 21:56

SeungjinYang commented Mar 12, 2025

View reviewed changes

sky/provision/kubernetes/utils.py Show resolved Hide resolved

SeungjinYang force-pushed the k8s-gke-autoscaler branch 2 times, most recently from 9505cf6 to 41001c4 Compare March 13, 2025 00:18

cg505 reviewed Mar 13, 2025

View reviewed changes

sky/clouds/kubernetes.py Show resolved Hide resolved

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved

sky/clouds/kubernetes.py Show resolved Hide resolved

SeungjinYang force-pushed the k8s-gke-autoscaler branch 3 times, most recently from 6151f03 to 0d999db Compare March 14, 2025 06:38

SeungjinYang self-assigned this Mar 14, 2025

SeungjinYang requested a review from cg505 March 14, 2025 18:40

SeungjinYang marked this pull request as ready for review March 14, 2025 19:40

SeungjinYang force-pushed the k8s-gke-autoscaler branch 3 times, most recently from e917c66 to d7de1aa Compare March 14, 2025 19:55

SeungjinYang force-pushed the k8s-gke-autoscaler branch from d7de1aa to 61e05b1 Compare March 14, 2025 20:08

romilbhardwaj reviewed Mar 14, 2025

View reviewed changes

sky/provision/kubernetes/utils.py Show resolved Hide resolved

cg505 reviewed Mar 15, 2025

View reviewed changes

cg505 approved these changes Mar 17, 2025

View reviewed changes

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved

SeungjinYang added 17 commits March 17, 2025 09:58

working codepath

9d2e45c

remove prints and an assert

678dbe0

make into classes

f90c86b

minor changes

ecabf71

update codepath comment

0cf9bc6

lint

86f22b2

slight reformat

00863a0

review feedback

8f3079d

autoscale_detecror -> autoscaler

c6dd38a

unnest regions_with_offering logic

157f153

short circuit on unsupported autoscaler

74629c4

formalize context name validation, add exception handling for cluster…

a279817

… info request

account for TPUs

7e0a2f5

code hardening

2e10942

remove AUTOSCALER_TO_LABEL_FORMATTER in favor of expanding AUTOSCALER…

aa1b7f4

…_TYPE_TO_AUTOSCALER

more debug logs, review feedbacks

223b37a

final review comments addressed

131e7c9

SeungjinYang force-pushed the k8s-gke-autoscaler branch from 2c8357c to 131e7c9 Compare March 17, 2025 17:38

SeungjinYang enabled auto-merge (squash) March 17, 2025 17:39

SeungjinYang merged commit 9e4f222 into master Mar 17, 2025
18 checks passed

SeungjinYang deleted the k8s-gke-autoscaler branch March 17, 2025 17:46

SeungjinYang mentioned this pull request Mar 17, 2025

[k8s] LRU cache for GKE can_create_new_instance_of_type #4973

Merged

2 tasks

Michaelvll reviewed Mar 18, 2025

View reviewed changes

SeungjinYang mentioned this pull request Mar 19, 2025

followup to #4935 #4989

Merged

1 task

SeungjinYang added a commit that referenced this pull request Mar 19, 2025

followup to #4935 (#4989)

6b938ce

* review comments * use .get() where it makes sense

SalikovAlex pushed a commit to SalikovAlex/skypilot that referenced this pull request Mar 20, 2025

followup to skypilot-org#4935 (skypilot-org#4989)

2ad110b

* review comments * use .get() where it makes sense

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] better support for GKE scale-to-zero autoscaling node pools #4935

[k8s] better support for GKE scale-to-zero autoscaling node pools #4935

SeungjinYang commented Mar 12, 2025 •

edited

Loading

SeungjinYang left a comment

SeungjinYang commented Mar 14, 2025

cg505 left a comment

SeungjinYang commented Mar 15, 2025

SeungjinYang commented Mar 16, 2025

cg505 left a comment

Michaelvll left a comment

Michaelvll Mar 18, 2025

SeungjinYang Mar 19, 2025

Michaelvll Mar 18, 2025

SeungjinYang Mar 19, 2025

Michaelvll Mar 18, 2025

Michaelvll Mar 18, 2025

SeungjinYang Mar 19, 2025 •

edited

Loading

Michaelvll Mar 18, 2025

Michaelvll Mar 18, 2025

SeungjinYang Mar 19, 2025

[k8s] better support for GKE scale-to-zero autoscaling node pools #4935

[k8s] better support for GKE scale-to-zero autoscaling node pools #4935

Conversation

SeungjinYang commented Mar 12, 2025 • edited Loading

SeungjinYang left a comment

Choose a reason for hiding this comment

SeungjinYang commented Mar 14, 2025

cg505 left a comment

Choose a reason for hiding this comment

SeungjinYang commented Mar 15, 2025

SeungjinYang commented Mar 16, 2025

cg505 left a comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeungjinYang Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeungjinYang commented Mar 12, 2025 •

edited

Loading

SeungjinYang Mar 19, 2025 •

edited

Loading