Skip to content

[k8s] better support for GKE scale-to-zero autoscaling node pools #4935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Mar 17, 2025

Conversation

SeungjinYang
Copy link
Collaborator

@SeungjinYang SeungjinYang commented Mar 12, 2025

Addresses #4875

Currently, if a node autoscaler is configured in a k8s cluster, the only thing skypilot knows about the autoscaler is the configuration provided by the user. Specifically, skypilot has no idea if there is a node pool with the node type that may be able to handle a job that has simply been autoscaled to zero. Currently, skypilot gets around this by simply submitting a pod to each context with autoscaler enabled and seeing if the pod is scheduled before timeout.

While this approach is functional, it is inefficient because:

  • A context (=cluster) may have an autoscaling node pool, but the node pool may not provide the VM needed to satisfy the request. For example, there may be an autoscaler on a node pool with A100 GPU VMs - skypilot doesn’t know this, only that there is an autoscaler group, and will try to launch H100 resources on it.
  • The autoscaling node pool may have the correct accelerator but different number of accelerators, CPU / memory constraints. For example, a node pool that spins up VMs with 1 A100 cannot handle launch requests with A100:8, but again skypilot doesn’t know that.
  • If there are multiple allowed contexts, and only some of them have autoscalers on them, there is no way for skypilot to know that. So skypilot may try to schedule a pod on a context w/o an autoscaler that cannot schedule the said pod.
    ^ note on above: the k8s autoscaler configuration is global, not per-context. A per-context autoscaler config could also solve this specific bullet point.

This PR attempts to solve these challenges for GKE autoscaler specifically. This is done by querying each context for its node pools, detecting if any node pool has autoscaling configured, and checking if any node can be spun up that can satisfy the job request.

Assumptions in code:

  • Context name follows convention of gke_PROJECT-ID_ZONE_CLUSTER-NAME. If not, skypilot will fallback to legacy codepath.
  • Customer has GCP auth set up for skypilot to query for GKE cluster details. If not, skypilot will fallback to legacy codepath.

Testing

  • Known failure case: skypilot[gcp] is not installed
% sky launch --gpus tpu-v5-lite-device:1 --cpus 32 --cloud kubernetes
Could not fetch autoscaler information from GKE. Run pip install "skypilot[gcp]" for more intelligent pod scheduling with GKE autoscaler.
Considered resources (1 node):
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                             vCPUs   Mem(GB)   ACCELERATORS           REGION/ZONE                                           COST ($)   CHOSEN   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   32CPU--128GB--tpu-v5-lite-device:1   32      128       tpu-v5-lite-device:1   gke_<project>_us-central1-a_skypilot-test-cluster   0.00          ✔     
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-56a0-seungjinyang'. Proceed? [Y/n]:
...
% sky launch --gpus a100:1 --cloud kubernetes                                                               
Could not fetch autoscaler information from GKE. Run pip install "skypilot[gcp]" for more intelligent pod scheduling with GKE autoscaler.
Considered resources (1 node):
---------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                                           COST ($)   CHOSEN   
---------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   2CPU--8GB--A100:1   2       8         A100:1         gke_<project>_us-central1-a_skypilot-test-cluster   0.00          ✔     
---------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-e17b-seungjinyang'. Proceed? [Y/n]: 

As seen, GPUs that are not on the autoscaled node pool still attempts to be scheduled; this is consistent with legacy code behavior.

  • Known failure case: context name does not follow gke standard context format.
    Tested on a GKE cluster with a node pool of ct5l-hightpu-1t (exposing 1xtpu-v5-lite-device per node).
% kubectx test=.
Context "gke_<project>_us-central1-a_skypilot-test-cluster" renamed to "test".
% sky launch --gpus tpu-v5-lite-device:1 --cpus 32 --cloud kubernetes
Considered resources (1 node):
------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                             vCPUs   Mem(GB)   ACCELERATORS           REGION/ZONE   COST ($)   CHOSEN   
------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   32CPU--128GB--tpu-v5-lite-device:1   32      128       tpu-v5-lite-device:1   test          0.00          ✔     
------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-becc-seungjinyang'. Proceed? [Y/n]: 
...
% sky launch --gpus a100:1 --cloud kubernetes              
Considered resources (1 node):
-----------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
-----------------------------------------------------------------------------------------------------
 Kubernetes   2CPU--8GB--A100:1   2       8         A100:1         test          0.00          ✔     
-----------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-0301-seungjinyang'. Proceed? [Y/n]: 

As seen, GPUs that are not on the autoscaled node pool still attempts to be scheduled; this is consistent with legacy code behavior.

  • Test newly introduced functionality
    Tested on a GKE cluster with a node pool of ct5l-hightpu-1t (exposing 1xtpu-v5-lite-device per node).
sky launch --gpus tpu-v5-lite-device:4 --cloud kubernetes    
No resource satisfying Kubernetes({'tpu-v5-lite-device': 4}, accelerator_args={}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'tpu-v5-lite-device': 4}, accelerator_args={}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.
% sky launch --gpus a100:1 --cloud kubernetes
No resource satisfying Kubernetes({'A100': 1}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'A100': 1}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.
% sky launch --gpus tpu-v5-lite-device:1 --cloud kubernetes
Considered resources (1 node):
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                          vCPUs   Mem(GB)   ACCELERATORS           REGION/ZONE                                           COST ($)   CHOSEN   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   2CPU--8GB--tpu-v5-lite-device:1   2       8         tpu-v5-lite-device:1   <project>_us-central1-a_skypilot-test-cluster   0.00          ✔     
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-169e-seungjinyang'. Proceed? [Y/n]:

As seen, skypilot only proceeds to schedule a pod on the cluster if the correct accelerator is present.

caveat
context: the A100 node deployed on the cluster has 12 vcpus and 85 gig of mem.

% sky launch --gpus a100:1 --cpus 12 --memory 85  --cloud kubernetes
Considered resources (1 node):
-----------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE              vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                                           COST ($)   CHOSEN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   12CPU--85GB--A100:1   12      85        A100:1         gke_<project>_us-central1-a_skypilot-test-cluster   0.00          ✔ 
-----------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-6d90-seungjinyang'. Proceed? [Y/n]:
Aborted!
% sky launch --gpus a100:1 --cpus 13 --memory 85 --cloud kubernetes
No resource satisfying Kubernetes(cpus=13, mem=85, {'A100': 1}) on Kubernetes.
Try specifying a different CPU count, or add "+" to the end of the CPU count to allow for larger instances.
Try specifying a different memory size, or add "+" to the end of the memory size to allow for larger instances.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes(cpus=13, mem=85, {'A100': 1}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.
% sky launch --gpus a100:1 --cpus 12 --memory 86 --cloud kubernetes
No resource satisfying Kubernetes(cpus=12, mem=86, {'A100': 1}) on Kubernetes.
Try specifying a different CPU count, or add "+" to the end of the CPU count to allow for larger instances.
Try specifying a different memory size, or add "+" to the end of the memory size to allow for larger instances.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes(cpus=12, mem=86, {'A100': 1}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

As seen here, skypilot will block requests on autoscaling node pool with GPUs that exceeds the node cpu and memory.

However, this is not true for requests on autoscaling node pool with TPUs.
context: the vt TPU node deployed on the cluster has 24 vcpus and 48 gig of mem.

% sky launch --gpus tpu-v5-lite-device:1 --cpus 24 --memory 48 --cloud kubernetes
Considered resources (1 node):
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                            vCPUs   Mem(GB)   ACCELERATORS           REGION/ZONE                                           COST ($)   CHOSEN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   24CPU--48GB--tpu-v5-lite-device:1   24      48        tpu-v5-lite-device:1   gke_<project>_us-central1-a_skypilot-test-cluster   0.00          ✔
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-b28a-seungjinyang'. Proceed? [Y/n]: n
Aborted!
% sky launch --gpus tpu-v5-lite-device:1 --cpus 999 --memory 999 --cloud kubernetes
Considered resources (1 node):
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                              vCPUs   Mem(GB)   ACCELERATORS           REGION/ZONE                                           COST ($)   CHOSEN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   999CPU--999GB--tpu-v5-lite-device:1   999     999       tpu-v5-lite-device:1   gke_<project>_us-central1-a_skypilot-test-cluster   0.00          ✔
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-5732-seungjinyang'. Proceed? [Y/n]: n

Requests that specify more cpus/mem than the TPU pod allows still goes through.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (see above)
  • Relevant individual tests: /smoke-test --kubernetes
  • Backward compatibility: /quicktest-core

Copy link
Collaborator Author

@SeungjinYang SeungjinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving review comments to new locations after the last commit

@SeungjinYang SeungjinYang force-pushed the k8s-gke-autoscaler branch 3 times, most recently from 40839da to c702207 Compare March 12, 2025 21:56
@SeungjinYang SeungjinYang force-pushed the k8s-gke-autoscaler branch 2 times, most recently from 9505cf6 to 41001c4 Compare March 13, 2025 00:18
@SeungjinYang SeungjinYang force-pushed the k8s-gke-autoscaler branch 3 times, most recently from 6151f03 to 0d999db Compare March 14, 2025 06:38
@SeungjinYang SeungjinYang self-assigned this Mar 14, 2025
@SeungjinYang SeungjinYang requested a review from cg505 March 14, 2025 18:40
@SeungjinYang SeungjinYang marked this pull request as ready for review March 14, 2025 19:40
@SeungjinYang SeungjinYang force-pushed the k8s-gke-autoscaler branch 3 times, most recently from e917c66 to d7de1aa Compare March 14, 2025 19:55
@SeungjinYang
Copy link
Collaborator Author

/quicktest-core

Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are a lot of conditionals/logic paths, I think more debugging logs (ideally at almost every branch) would be super helpful in the future when we have to debug some customer issues with this.

@SeungjinYang
Copy link
Collaborator Author

I've added a good bit of debug logs based on my experience developing this PR.

@SeungjinYang
Copy link
Collaborator Author

quicktest-core and smoke-test --kubernetes are passing as of this commit

Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now! Left two minor comments you can address before merging.

@SeungjinYang SeungjinYang enabled auto-merge (squash) March 17, 2025 17:39
@SeungjinYang SeungjinYang merged commit 9e4f222 into master Mar 17, 2025
18 checks passed
@SeungjinYang SeungjinYang deleted the k8s-gke-autoscaler branch March 17, 2025 17:46
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SeungjinYang and @cg505 for getting this in! I am leaving some comments for some minor issues. : )


# This variable is stored in memory in the server.
# The variable will reset if the server restarts.
_pip_install_gcp_hint_last_sent = 0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't persist across multiple invokes of SkyPilot. Wondering what is the purpose of this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works empirically (i.e. the timeout is respected across multiple sky invocations), except the first invocation after sky api stop;sky api start. This is because GKEAutoscaler is instantiated on the server side and this variable is not reset unless the server is reset.

Comment on lines +658 to +670
for node_pool in cluster['nodePools']:
logger.debug(f'checking if node pool {node_pool["name"]} '
'has autoscaling enabled.')
if (node_pool['autoscaling'] is not None and
'enabled' in node_pool['autoscaling'] and
node_pool['autoscaling']['enabled']):
logger.debug(
f'node pool {node_pool["name"]} has autoscaling enabled. '
'Checking if it can create a node '
f'satisfying {instance_type}')
if cls._check_instance_fits_gke_autoscaler_node_pool(
instance_type, node_pool):
return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, if there is any chance we will get a KeyError for these callings? Since the autoscaler check is optional optimization, we may want to make sure it does not fail in any unexpected cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never return a KeyError, but I know better than to rely on should. I'll just execute this code block with a try catch on KeyError and fallback to scheduling pods optimistically if there is one.

Comment on lines +674 to +692
def _validate_context_name(cls, context: str) -> Tuple[bool, str, str, str]:
"""Validates the context name is in the format of
gke_PROJECT-ID_LOCATION_CLUSTER-NAME
Returns:
bool: True if the context name is in the format of
gke_PROJECT-ID_LOCATION_CLUSTER-NAME
str: project id
str: location
str: cluster name
"""
context_components = context.split('_')
if len(context_components) != 4 or context_components[0] != 'gke':
logger.debug(
f'context {context} is not in valid GKE context format.')
return False, '', '', ''

logger.debug(f'context {context} is in valid GKE context format.')
return True, context_components[1], context_components[
2], context_components[3]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the context name is not a very robust solution, since a user can always change the context name manually. Is there a better way to check if a cluster is on GKE?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also relates to the robustness comment above. If a context name is set to start with gke_, will it cause issue of KeyError above?

Copy link
Collaborator Author

@SeungjinYang SeungjinYang Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the parsed context information to query the GCP backend for the cluster info. If the context name passes the parser but isn't actually a valid GKE context, the parsed context would fail to be retrieved from GCP. In that case we fall back to optimistically scheduling pods.

to fit the instance type.
"""
for accelerator in node_pool_accelerators:
node_accelerator_type = GKELabelFormatter. \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: please avoid using \, use () instead

Comment on lines +813 to +817
if (len(machine_type_parts) != 3 or
not machine_type_parts[0].startswith('ct') or
machine_type_parts[1] != 'hightpu' or
not machine_type_parts[2].endswith('t') or
not machine_type_parts[2].strip('t').isdigit()):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is a bit hard to read. Would it be possible to just use a regex for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got regex to work indeed

@SeungjinYang SeungjinYang mentioned this pull request Mar 19, 2025
1 task
SeungjinYang added a commit that referenced this pull request Mar 19, 2025
* review comments

* use .get() where it makes sense
SalikovAlex pushed a commit to SalikovAlex/skypilot that referenced this pull request Mar 20, 2025
…ypilot-org#4935)

* working codepath

* remove prints and an assert

* make into classes

* minor changes

* update codepath comment

* lint

* slight reformat

* review feedback

* autoscale_detecror -> autoscaler

* unnest regions_with_offering logic

* short circuit on unsupported autoscaler

* formalize context name validation, add exception handling for cluster info request

* account for TPUs

* code hardening

* remove AUTOSCALER_TO_LABEL_FORMATTER in favor of expanding AUTOSCALER_TYPE_TO_AUTOSCALER

* more debug logs, review feedbacks

* final review comments addressed
SalikovAlex pushed a commit to SalikovAlex/skypilot that referenced this pull request Mar 20, 2025
* review comments

* use .get() where it makes sense
cblmemo added a commit that referenced this pull request Mar 20, 2025
* Add Nebius storage integration and associated utilities

This commit introduces support for Nebius object storage, enabling users to integrate Nebius storage with various functionalities such as file mounting, syncing, and cloud transfers. It includes necessary adaptors, utility methods, and updates to the storage framework to handle Nebius-specific configurations and endpoints.

* Add support for Nebius storage mounting and testing

Implemented new Nebius storage mounting functionality, including COPY and MOUNT modes, with tests. Updated cloud_stores.py for AWS CLI compatibility and added a YAML template for Nebius storage configurations. Removed outdated Nebius test case in favor of the new approach.

* format

* Add Nebius object storage support across tests and utilities

This commit introduces comprehensive Nebius support, making it accessible for S3-compatible operations including bucket creation, deletion, and mounting. It removes reliance on the AWS_SHARED_CREDENTIALS_FILE environment variable, streamlines Nebius-specific configurations, and adds associated unit test parameters to validate functionality across storage operations.

* fix

* typo

* Refactor Nebius adaptor and improve clarity.

Remove redundant code, streamline imports, and enhance error messaging. Adjust documentation for better accuracy and update function annotations. These changes improve maintainability and readability of the Nebius adaptor module.

* Refactor Nebius storage setup and clean up debug print.

Simplified Nebius AWS CLI installation by reusing the S3CloudStorage configuration for consistency. Removed unnecessary debug print in `run_upload_cli` to reduce console noise. Minor formatting adjustment in test YAML file.

* Refactor Nebius storage handling and add timeout for deletions

Clean up and improve code readability, including string formatting and conditionals. Introduce `_TIMEOUT_TO_PROPAGATES` to handle timeout while verifying Nebius bucket deletions. Update comments to reflect the corrected usage on Nebius servers.

* Refactor subprocess call and improve timeout error messaging.

Removed unused variable from subprocess call to clean up code. Updated timeout error to include the bucket name for more detailed and helpful error reporting.

* Set default region for Nebius Object Storage if none provided

Updated the helper method to assign a default region when no region is specified, ensuring compatibility with Nebius Object Storage. This change avoids potential errors caused by missing region values.

* Support Nebius URLs in file sync commands

Replace 'nebius://' with 's3://' in source paths to ensure compatibility with AWS CLI commands. This allows seamless integration of Nebius storage endpoints.

* [Docs] Add quick start to k8s getting started docs (#4799)

* k8s quick start

* title

* [Docs] New "Examples" section (#4858)

* WIP: Examples dropdown.

* update new

* WIP

* local render is fine; need to add files

* test pip

* fix

* add missing

* add missing

* add missing

* add missing

* try .mddmissing

* add missing

* updates

* updates

* Instructions

* updates

* cleanup

* fix

* updates

* lint

* updates

* add RAG

* RAG new

* add missing

* refactor

* fix redirection and warnings

* generate before build

* remove uneccessary source

* minor

* remove generated examples

* fix header

* priorize readme file

* avoid remove

* format

* update README

* update links

* try fix stem/name

* add paper

* updates

* update task -> skypilot yaml

* source/generate_examples.py: revert to .stem

---------

Co-authored-by: Zhanghao Wu <[email protected]>

* [API Server] Fix admin policy enforcement on `validate` and `optimize` (#4820)

* Add admin policy to validate

* Add admin policy to validate

* Add admin policy to optimize

* docs

* imports

* Move dag validation to core

* Fixes

* lint

* Add comments

* lint

* Fixed executor based validate implementation

* Revert executor based validate implementation

* lint

* lint

* Add validation during optimize

* lint

* Remove validate from core

* Remove admin policy apply when validating dag for exec

* comments

* Bump API version

* comments

* [Core] Exit with non-zero code on launch/exec/logs/jobs launch/jobs logs (#4846)

* Support return code based on job success/failure

* Return exit code for tailing managed jobs

* Fixes

* lint

* Create JobExitCode enum

* Get JobExitCode from ManagedJobStatus

* lint

* cleanup

* cleanup

* Add tests

* lint

* Managed jobs back compat

* Skylet backward compatibility

* lint

* Update logs --status returncodes

* Update logs --status returncodes

* lint

* fix retcode

* Fix tests

* lint

* Fix --no-follow

* Fix cli docs rendering

* minor

* rename ret_code to returncode

* rename SUCCESS to SUCCEEDED

* Refactor JobExitCode to exceptions

* lint

* [Storage] Fix storage deletion for all (#4872)

Fix storage deletion for all

* [Docs] Avoid back links in FAQ (#4866)

Avoid back links

* Serve log before termination for smoke tests (#4691)

* serve log before termination

* restore change

* replace command

* fix

* add sky serve status

* [Dashboard] Fix Log Download (#4844)

* download preview

* refactor log content column

* fix column issue

* [jobs] catch NotSupportedError for `sky down --purge` (#4811)

Fixes #4626.

* [Test] fixed managed job return code with --no-follow for compatibility test (#4887)

* [Test] fixed backward compatibility test

Signed-off-by: Aylei <[email protected]>

* lint

Signed-off-by: Aylei <[email protected]>

* temp test

Signed-off-by: Aylei <[email protected]>

* revert temp change

Signed-off-by: Aylei <[email protected]>

---------

Signed-off-by: Aylei <[email protected]>

* show managed jobs user column in `sky status -u` (#4889)

* [Examples] Rename airflow DAG (#4898)

* Rename to sky_train_dag

* rename

* [API server] honor SKYPILOT_DEBUG env in server log (#4883)

Signed-off-by: Aylei <[email protected]>

* [jobs] resolve jobs queue user on API server side (#4897)

* [jobs] resolve jobs queue user on API server side

* lint

* note user_name is optional

* Updates the vast catalog to write directly to the vms.csv (#4891)

Previously this file emitted to sys.stdout which prevented
the catalog-fetcher from actually updating the catalog.

This has now been updated matching many of the patterns
employed by other vendors in this directory.

* [Docs] Minor updates to installation.rst (#4888)

* [Docs] K8s docs updates (#4902)

Fixes to k8s docs

* [jobs] fix dashboard for remote API server (#4895)

* [jobs] fix dashboard for remote API server

* fix for k8s

* [docs] add jobs controller resource tuning reference in config page (#4909)

* [Core] Handle mid-sequence chunking in log streaming (#4908)

* Handle mid-sequence chunking

* format

* Handle actual UnicodeDecodeError

* lint

* Exclude `.pyc` and `__pycache__` files from config hash calculation to fix `test_launch_fast --kubernetes` failures (#4880)

* filter out pyc and pycache

* filter out pyc and pycache

* handle edge case

* None for the case where file might have been deleted after listing

* add comment

* [Docs] Add a few more examples for k8s. (#4911)

* Add some new Example links.

* Finetune landing/README.

* Updates

* No fork button

* [Docs] Add team deployment in existing machine and `detach_run` in docs (#4913)

* Indicate remote API server for jobs

* Add api deployment and detach_run in docs

* avoid console for better copy paste

* avoid more console

* fix

* rename

* update doc

* format

* revert

* Update docs/source/reservations/existing-machines.rst

Co-authored-by: Zongheng Yang <[email protected]>

---------

Co-authored-by: Zongheng Yang <[email protected]>

* update PR template to use CI tests (#4917)

* update template

* update

* no bold

* [UX] Auto-exclude unavailable kubernetes contexts (#4692)

* [UX] Exclude stale kubernetes context

- Improve Kubernetes context and node retrieval error handling
- Add context-aware retry mechanism for Kubernetes API calls

* catch broad error

Signed-off-by: Aylei <[email protected]>

* track unavailable contexts

Signed-off-by: Aylei <[email protected]>

* typo

Signed-off-by: Aylei <[email protected]>

* remove irrelevant change

Signed-off-by: Aylei <[email protected]>

* address review comments

Signed-off-by: Aylei <[email protected]>

* Update sky/clouds/kubernetes.py

Co-authored-by: Zhanghao Wu <[email protected]>

* address review comments

Signed-off-by: Aylei <[email protected]>

* address review comments

Signed-off-by: Aylei <[email protected]>

* cover unreachable context in smoke test

Signed-off-by: Aylei <[email protected]>

* cover unreachable context in smoke test

Signed-off-by: Aylei <[email protected]>

* fix post cleanup in multi-k8s

Signed-off-by: Aylei <[email protected]>

* more comments

Signed-off-by: Aylei <[email protected]>

---------

Signed-off-by: Aylei <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>

* [API server] accelerate start by slowly start workers (#4885)

* [API server] accelerate start by slowly start workers

Signed-off-by: Aylei <[email protected]>

* Address review comments

Signed-off-by: Aylei <[email protected]>

* always close

Signed-off-by: Aylei <[email protected]>

* Address review comments

Signed-off-by: Aylei <[email protected]>

---------

Signed-off-by: Aylei <[email protected]>

* more permissive match for k8s accelerators (#4925)

* case insensitive match for k8s accelerators

* fix typo in canonicalization func

* format

* [core] if not all nodes are in ray status, double check after 5s (#4916)

* [core] if not all nodes are in ray status, double check after 5s

* add a comment explaining the situation more

* [Docs] Remove `networking: nodeport` from config docs (#4928)

Remove `networking: nodeport` from config

* [Core] Fix failover handler for clouds moved to new provisioner (#4919)

* Fix failover handler

* remove unused handler

* [Test] Cost down smoke tests (#4813)

* change cpu to 2+ and memory to 4+

* remove some resource heavy

* update yaml

* intermediate bucket yaml

* cloud aws for test_managed_jobs_pipeline_recovery_aws

* pipeline yaml update

* fix

* fix

* larger the size of kube

* resource heavy

* test skyserve_update

* test skyserve_update

* fix kubernetes test failure

* skyserve_streaming

* more kubernetes high resource test

* restore azure of test_skyserve_rolling_update

* restore azure change

* restore change

* restore test_skyserve_rolling_update

* bug fix:

* fix yaml

* v100 does not require low resource

* no special resource for kubernetes tests

* add more for master test

* test_multi_tenant_managed_jobs low resource

* managed_job_storage

* longer timeout for kube

* resolve PR comment

* rename function

* Add linting for sentence case in Markdown and reST headings (#4805)

* linting

* subtitle

* draft linting

* update linting script

* title lowercase

* fix

* pass build

* simplified logic

* resolve review comment

* resolve review comment

* restore change

* resolve comment

* [Core] sky exec now waits cluster to be started (#4867)

* [Core] sky exec now waits cluster to be started

Signed-off-by: Aylei <[email protected]>

* add smoke test case

Signed-off-by: Aylei <[email protected]>

* refine smoke

Signed-off-by: Aylei <[email protected]>

* fix smoke test

Signed-off-by: Aylei <[email protected]>

* Apply suggestions from code review

Co-authored-by: Zhanghao Wu <[email protected]>

* address review comments

Signed-off-by: Aylei <[email protected]>

* Address review comments

Signed-off-by: Aylei <[email protected]>

---------

Signed-off-by: Aylei <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>

* [Docs] Minor: pull up a page. (#4929)

* `Fix Nebius integration issues and update storage error message`

Updated the `create_endpoint` function to ensure the `region` parameter is strictly typed as `str`. Modified `create_nebius_client` to accept `None` as the default region. Additionally, corrected the error message in `storage.py` to specify 'nebius' instead of 's3'.

* typo

* Refactor storage handling and update R2 credentials usage

Updated R2 command to explicitly set AWS_SHARED_CREDENTIALS_FILE for better credential management. Simplified region assignment logic in storage initialization to improve code readability and maintainability.

* Refactor Nebius-related code for clarity and correctness

Ensure Nebius paths are properly validated and transformed, replacing `if` checks with assertions. Fixed default region handling in `create_endpoint` and corrected variable naming in `split_nebius_path` for consistency. These changes enhance code reliability and maintainability.

* Refactor SDK initialization to use a cached global instance.

Introduce a global `_sdk` variable to cache the SDK instance, preventing redundant initialization. This improves efficiency by avoiding repeated calls to `nebius.sdk.SDK()` in the `sdk()` function. The logic ensures `_sdk` is only initialized once, either with IAM credentials or a credentials file.

* Update bucket URI format in mount and storage test

Replaced the `bucket_uri` returned in the test with a prefixed `nebius://` format. This ensures consistency with updated storage access conventions.

* format

* [Jobs][UX] add -all option to jobs queue printing  (#4923)

* add all option

* formatting

* fix comments

* Refactor jobs queue display logic and improve job listing

* [deps] pin ibm-platform-services to >=0.48.0 to work around issue (#4939)

* [api server] avoid deleting requests.db but not -wal/-shm (#4941)

[api server] avoid deleting requests.db without -wal/-shm

* [Test] Fix kubernetes failure tests (#4874)

* resource_heavy for test_multi_tenant_managed_jobs

* longer initial delay and resource_heavy

* test launch fast

* test again

* more test

* more log

* more log

* more log

* more log

* more log

* restore log

* remove resource heavy

* restore change

* longer initial delay

* wait for NOT_READY for test_skyserve_rolling_update test

* remove unuse import

* increase the sleep to 120

* f format

* fix test_managed_jobs_storage and test_kubernetes_storage_mounts

* restore deleted test

* restore more

* remove resource_heavy

* test

* test again

* fix azure check

* test one more time

* test one more time

* Revert "test one more time"

This reverts commit 029a3a7.

* Revert "test one more time"

This reverts commit fa70b8f.

* Revert "test again"

This reverts commit 3480116.

* Revert "test"

This reverts commit c695b56.

* fix

* add comment

* no spot for kubernetes test

* no spot

* bigger initial delay

* longer initial delay

* check if its eks cluster

* fix bool arg

* [k8s] filter out nodes with less accelerators than requested (#4930)

* filter out nodes in gke with less accelerators than requested

* address comments

* gpu check executes on non-tpu nodes

* [Jobs] Error out for intermediate bucket on cloud not enabled (#4942)

* Error out for intermediate bucket on cloud not enabled

* better logging for reauth error

* Add reauth exception

* format

* [Docs] Add docs on implementing priorities in k8s (#4803)

* Add priorities page

* Address comments, add to k8s setup docs

* fixes

* Fixes

* [Docs] Minor wording changes. (#4940)

* wip

* updates

* reword

* add

* [Examples] LLM/Gemma3 Example  (#4937)

* gemma3

* Update gemma3.yaml to specify exact versions for transformers and vllm installations; add readiness probe configuration in service section. Update README.md to correct command option from 'deepseek' to 'gemma-3'.

* Remove outdated command option from README.md for clarity.

* update readme for serving

* Update README.md to correct the number of nodes in the service launch command and update the link to the appropriate SkyPilot YAML file.

* [Doc] Gemma3 doc update (#4948)

Update README and documentation to include Gemma 3 model and example. Added Gemma 3 to the news section in README.md and updated the models index in documentation.

* [Docs] Update k8s volume mounting docs + refactor optional steps (#4934)

* Update volume mounting docs

* Update volume mounting docs

* Update volume mounting docs

* Add nested tabset, restructure optional steps

* Move volume mounting docs

* Update volume mounting docs

* Reorder

* casing

* Comments

* fix

* reduce links

* [Test] Simplified buildkite agent queue (#4932)

* remove serve and backcompact

* ignore buildkite yaml file

* [Docs] Update benefits for client-server (#4945)

* Update benefits for client-server

* update

* Update docs/source/reference/api-server/api-server.rst

Co-authored-by: Zongheng Yang <[email protected]>

---------

Co-authored-by: Zongheng Yang <[email protected]>

* Fix flaky for test_cancel_launch_and_exec_async (#4966)

* fix flaky for test_cancel_launch_and_exec_async

* comma

* use generic_cloud

* new line format

* [Docs] fix typo in gemma3 example (#4971)

Signed-off-by: Aylei <[email protected]>

* [k8s] better support for GKE scale-to-zero autoscaling node pools (#4935)

* working codepath

* remove prints and an assert

* make into classes

* minor changes

* update codepath comment

* lint

* slight reformat

* review feedback

* autoscale_detecror -> autoscaler

* unnest regions_with_offering logic

* short circuit on unsupported autoscaler

* formalize context name validation, add exception handling for cluster info request

* account for TPUs

* code hardening

* remove AUTOSCALER_TO_LABEL_FORMATTER in favor of expanding AUTOSCALER_TYPE_TO_AUTOSCALER

* more debug logs, review feedbacks

* final review comments addressed

* fix incorrect vcpu/mem checks for GKE autoscaler (#4972)

* fix annotation "kubernetes.io/ingress.class" is deprecated (#4974)

* fix annotation "kubernetes.io/ingress.class" is deprecated

Signed-off-by: Ajay-Satish-01 <[email protected]>

* fix: ingress spec based on version

---------

Signed-off-by: Ajay-Satish-01 <[email protected]>

* [UX] Fix dense cli for resources not enough (#4962)

fix dense cli

* [API server] attach setup of controllers (#4931)

* [API server] attach setup of controllers

Signed-off-by: Aylei <[email protected]>

* lint

Signed-off-by: Aylei <[email protected]>

* Address review comments

Signed-off-by: Aylei <[email protected]>

---------

Signed-off-by: Aylei <[email protected]>

* [Test] Add support for missing bashrc file in zsh shells (#4963)

* add support for zsh

* fix for bashrc after testing

* [Docs] Fix NFS mounting docs for k8s (#4951)

Add kubernetes key

* [k8s] GKE support for TPU V6 (#4986)

* [k8s] GKE support for TPU V6

* gke t6 support

* remove wrong check

* Fix test_managed_jobs_storage failure on azure in master branch (#4965)

* fix

* longer timeout

* [Test]Separate different param into different steps on buildkite and fix flacky of test_job_queue_with_docker (#4955)

* different param to different steps

* longer time to sleep

* [API server] cleanup executor processes on shutdown (#4912)

* [API server] cleanup executor on shutdown

Signed-off-by: Aylei <[email protected]>

* refine

Signed-off-by: Aylei <[email protected]>

* just raise impossible exceptions

Signed-off-by: Aylei <[email protected]>

* Update sky/utils/subprocess_utils.py

Co-authored-by: Zhanghao Wu <[email protected]>

---------

Signed-off-by: Aylei <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>

* [k8s] LRU cache for GKE can_create_new_instance_of_type (#4973)

* LRU cache for can_create_new_instance_of_type

* request scope

* [Test]Refactor backward compatibility test (#4906)

* backward compat

* fix

* backcompat update

* generate pipeline

* bug fix

* remove deactivate

* robust backcompact test

* fix

* more log

* bug fix

* subprocess run with bash

* bug fix

* update template

* fix flaky

* limit concurrency

* pip install uv

* fix

* low resource

* fix

* bump python version to 3.10

* recreate env

* import order

* [Core] Independent storage check (#4977)

* independent storage check

* formatting

* granular perms

* _is_storage_cloud_enabled uses storage check

* UX improvement

* remove debug logs

* fix local test

* sky check no regression

* no sky check regression, managed jobs work

* api backwards compatibility

* define globally minimal perms for gcp

* review feedback

* continue from except

* [GCP] Don't require TPU support for serve:gcp if TPU support is not required (#4991)

don't require tpu support for serve:gcp if tpu support is not required

* followup to #4935 (#4989)

* review comments

* use .get() where it makes sense

* [Serve] BugFix: `any_of` field order issue cause version bump to not work (#4978)

* [Serve] BugFix: `any_of` field order issue cause version bump to not work

* upd

* [Example] Batch Inference  (#4994)

* initial code for batched inference

* Refactor batch inference scripts and configuration files for improved consistency and clarity. Removed unused bucket name generation and monitoring service launch from `batch_compute_vectors.py`. Updated `compute_text_vectors.yaml` and `monitor_progress.yaml` to use a unified bucket name. Revised README to reflect changes in embedding generation focus and performance highlights.

* formattting

* Update batch inference configuration files for consistency. Renamed `compute_text_vectors` and `monitor_progress` to include `batch-inference` prefix. Revised README to enhance clarity on monitoring progress and accessing the service.

* Update README for batch inference: replaced local image links with external URLs, corrected endpoint variable in monitoring instructions, and added a new image for enhanced visual representation.

* Update README for batch inference: corrected image URLs to include file extensions and added a new section for further learning resources.

* Enhance README for batch inference: updated the section on computing embeddings to include details about the Amazon reviews dataset and clarified the use of the `Alibaba-NLP/gte-Qwen2-7B-instruct` model for generating embeddings.

* update banner

---------

Signed-off-by: Aylei <[email protected]>
Signed-off-by: Ajay-Satish-01 <[email protected]>
Co-authored-by: Romil Bhardwaj <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: zpoint <[email protected]>
Co-authored-by: Kaiyuan Eric Chen <[email protected]>
Co-authored-by: Christopher Cooper <[email protected]>
Co-authored-by: Aylei <[email protected]>
Co-authored-by: chris mckenzie <[email protected]>
Co-authored-by: Seung Jin <[email protected]>
Co-authored-by: Ajay Satish <[email protected]>
Co-authored-by: Daniel Shin <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants