-
Notifications
You must be signed in to change notification settings - Fork 633
[UX] Fix dense cli for resources not enough #4962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@Michaelvll this is ready for review! |
SeungjinYang
approved these changes
Mar 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks good.
SalikovAlex
pushed a commit
to SalikovAlex/skypilot
that referenced
this pull request
Mar 20, 2025
cblmemo
added a commit
that referenced
this pull request
Mar 20, 2025
* Add Nebius storage integration and associated utilities This commit introduces support for Nebius object storage, enabling users to integrate Nebius storage with various functionalities such as file mounting, syncing, and cloud transfers. It includes necessary adaptors, utility methods, and updates to the storage framework to handle Nebius-specific configurations and endpoints. * Add support for Nebius storage mounting and testing Implemented new Nebius storage mounting functionality, including COPY and MOUNT modes, with tests. Updated cloud_stores.py for AWS CLI compatibility and added a YAML template for Nebius storage configurations. Removed outdated Nebius test case in favor of the new approach. * format * Add Nebius object storage support across tests and utilities This commit introduces comprehensive Nebius support, making it accessible for S3-compatible operations including bucket creation, deletion, and mounting. It removes reliance on the AWS_SHARED_CREDENTIALS_FILE environment variable, streamlines Nebius-specific configurations, and adds associated unit test parameters to validate functionality across storage operations. * fix * typo * Refactor Nebius adaptor and improve clarity. Remove redundant code, streamline imports, and enhance error messaging. Adjust documentation for better accuracy and update function annotations. These changes improve maintainability and readability of the Nebius adaptor module. * Refactor Nebius storage setup and clean up debug print. Simplified Nebius AWS CLI installation by reusing the S3CloudStorage configuration for consistency. Removed unnecessary debug print in `run_upload_cli` to reduce console noise. Minor formatting adjustment in test YAML file. * Refactor Nebius storage handling and add timeout for deletions Clean up and improve code readability, including string formatting and conditionals. Introduce `_TIMEOUT_TO_PROPAGATES` to handle timeout while verifying Nebius bucket deletions. Update comments to reflect the corrected usage on Nebius servers. * Refactor subprocess call and improve timeout error messaging. Removed unused variable from subprocess call to clean up code. Updated timeout error to include the bucket name for more detailed and helpful error reporting. * Set default region for Nebius Object Storage if none provided Updated the helper method to assign a default region when no region is specified, ensuring compatibility with Nebius Object Storage. This change avoids potential errors caused by missing region values. * Support Nebius URLs in file sync commands Replace 'nebius://' with 's3://' in source paths to ensure compatibility with AWS CLI commands. This allows seamless integration of Nebius storage endpoints. * [Docs] Add quick start to k8s getting started docs (#4799) * k8s quick start * title * [Docs] New "Examples" section (#4858) * WIP: Examples dropdown. * update new * WIP * local render is fine; need to add files * test pip * fix * add missing * add missing * add missing * add missing * try .mddmissing * add missing * updates * updates * Instructions * updates * cleanup * fix * updates * lint * updates * add RAG * RAG new * add missing * refactor * fix redirection and warnings * generate before build * remove uneccessary source * minor * remove generated examples * fix header * priorize readme file * avoid remove * format * update README * update links * try fix stem/name * add paper * updates * update task -> skypilot yaml * source/generate_examples.py: revert to .stem --------- Co-authored-by: Zhanghao Wu <[email protected]> * [API Server] Fix admin policy enforcement on `validate` and `optimize` (#4820) * Add admin policy to validate * Add admin policy to validate * Add admin policy to optimize * docs * imports * Move dag validation to core * Fixes * lint * Add comments * lint * Fixed executor based validate implementation * Revert executor based validate implementation * lint * lint * Add validation during optimize * lint * Remove validate from core * Remove admin policy apply when validating dag for exec * comments * Bump API version * comments * [Core] Exit with non-zero code on launch/exec/logs/jobs launch/jobs logs (#4846) * Support return code based on job success/failure * Return exit code for tailing managed jobs * Fixes * lint * Create JobExitCode enum * Get JobExitCode from ManagedJobStatus * lint * cleanup * cleanup * Add tests * lint * Managed jobs back compat * Skylet backward compatibility * lint * Update logs --status returncodes * Update logs --status returncodes * lint * fix retcode * Fix tests * lint * Fix --no-follow * Fix cli docs rendering * minor * rename ret_code to returncode * rename SUCCESS to SUCCEEDED * Refactor JobExitCode to exceptions * lint * [Storage] Fix storage deletion for all (#4872) Fix storage deletion for all * [Docs] Avoid back links in FAQ (#4866) Avoid back links * Serve log before termination for smoke tests (#4691) * serve log before termination * restore change * replace command * fix * add sky serve status * [Dashboard] Fix Log Download (#4844) * download preview * refactor log content column * fix column issue * [jobs] catch NotSupportedError for `sky down --purge` (#4811) Fixes #4626. * [Test] fixed managed job return code with --no-follow for compatibility test (#4887) * [Test] fixed backward compatibility test Signed-off-by: Aylei <[email protected]> * lint Signed-off-by: Aylei <[email protected]> * temp test Signed-off-by: Aylei <[email protected]> * revert temp change Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> * show managed jobs user column in `sky status -u` (#4889) * [Examples] Rename airflow DAG (#4898) * Rename to sky_train_dag * rename * [API server] honor SKYPILOT_DEBUG env in server log (#4883) Signed-off-by: Aylei <[email protected]> * [jobs] resolve jobs queue user on API server side (#4897) * [jobs] resolve jobs queue user on API server side * lint * note user_name is optional * Updates the vast catalog to write directly to the vms.csv (#4891) Previously this file emitted to sys.stdout which prevented the catalog-fetcher from actually updating the catalog. This has now been updated matching many of the patterns employed by other vendors in this directory. * [Docs] Minor updates to installation.rst (#4888) * [Docs] K8s docs updates (#4902) Fixes to k8s docs * [jobs] fix dashboard for remote API server (#4895) * [jobs] fix dashboard for remote API server * fix for k8s * [docs] add jobs controller resource tuning reference in config page (#4909) * [Core] Handle mid-sequence chunking in log streaming (#4908) * Handle mid-sequence chunking * format * Handle actual UnicodeDecodeError * lint * Exclude `.pyc` and `__pycache__` files from config hash calculation to fix `test_launch_fast --kubernetes` failures (#4880) * filter out pyc and pycache * filter out pyc and pycache * handle edge case * None for the case where file might have been deleted after listing * add comment * [Docs] Add a few more examples for k8s. (#4911) * Add some new Example links. * Finetune landing/README. * Updates * No fork button * [Docs] Add team deployment in existing machine and `detach_run` in docs (#4913) * Indicate remote API server for jobs * Add api deployment and detach_run in docs * avoid console for better copy paste * avoid more console * fix * rename * update doc * format * revert * Update docs/source/reservations/existing-machines.rst Co-authored-by: Zongheng Yang <[email protected]> --------- Co-authored-by: Zongheng Yang <[email protected]> * update PR template to use CI tests (#4917) * update template * update * no bold * [UX] Auto-exclude unavailable kubernetes contexts (#4692) * [UX] Exclude stale kubernetes context - Improve Kubernetes context and node retrieval error handling - Add context-aware retry mechanism for Kubernetes API calls * catch broad error Signed-off-by: Aylei <[email protected]> * track unavailable contexts Signed-off-by: Aylei <[email protected]> * typo Signed-off-by: Aylei <[email protected]> * remove irrelevant change Signed-off-by: Aylei <[email protected]> * address review comments Signed-off-by: Aylei <[email protected]> * Update sky/clouds/kubernetes.py Co-authored-by: Zhanghao Wu <[email protected]> * address review comments Signed-off-by: Aylei <[email protected]> * address review comments Signed-off-by: Aylei <[email protected]> * cover unreachable context in smoke test Signed-off-by: Aylei <[email protected]> * cover unreachable context in smoke test Signed-off-by: Aylei <[email protected]> * fix post cleanup in multi-k8s Signed-off-by: Aylei <[email protected]> * more comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> * [API server] accelerate start by slowly start workers (#4885) * [API server] accelerate start by slowly start workers Signed-off-by: Aylei <[email protected]> * Address review comments Signed-off-by: Aylei <[email protected]> * always close Signed-off-by: Aylei <[email protected]> * Address review comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> * more permissive match for k8s accelerators (#4925) * case insensitive match for k8s accelerators * fix typo in canonicalization func * format * [core] if not all nodes are in ray status, double check after 5s (#4916) * [core] if not all nodes are in ray status, double check after 5s * add a comment explaining the situation more * [Docs] Remove `networking: nodeport` from config docs (#4928) Remove `networking: nodeport` from config * [Core] Fix failover handler for clouds moved to new provisioner (#4919) * Fix failover handler * remove unused handler * [Test] Cost down smoke tests (#4813) * change cpu to 2+ and memory to 4+ * remove some resource heavy * update yaml * intermediate bucket yaml * cloud aws for test_managed_jobs_pipeline_recovery_aws * pipeline yaml update * fix * fix * larger the size of kube * resource heavy * test skyserve_update * test skyserve_update * fix kubernetes test failure * skyserve_streaming * more kubernetes high resource test * restore azure of test_skyserve_rolling_update * restore azure change * restore change * restore test_skyserve_rolling_update * bug fix: * fix yaml * v100 does not require low resource * no special resource for kubernetes tests * add more for master test * test_multi_tenant_managed_jobs low resource * managed_job_storage * longer timeout for kube * resolve PR comment * rename function * Add linting for sentence case in Markdown and reST headings (#4805) * linting * subtitle * draft linting * update linting script * title lowercase * fix * pass build * simplified logic * resolve review comment * resolve review comment * restore change * resolve comment * [Core] sky exec now waits cluster to be started (#4867) * [Core] sky exec now waits cluster to be started Signed-off-by: Aylei <[email protected]> * add smoke test case Signed-off-by: Aylei <[email protected]> * refine smoke Signed-off-by: Aylei <[email protected]> * fix smoke test Signed-off-by: Aylei <[email protected]> * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * address review comments Signed-off-by: Aylei <[email protected]> * Address review comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> * [Docs] Minor: pull up a page. (#4929) * `Fix Nebius integration issues and update storage error message` Updated the `create_endpoint` function to ensure the `region` parameter is strictly typed as `str`. Modified `create_nebius_client` to accept `None` as the default region. Additionally, corrected the error message in `storage.py` to specify 'nebius' instead of 's3'. * typo * Refactor storage handling and update R2 credentials usage Updated R2 command to explicitly set AWS_SHARED_CREDENTIALS_FILE for better credential management. Simplified region assignment logic in storage initialization to improve code readability and maintainability. * Refactor Nebius-related code for clarity and correctness Ensure Nebius paths are properly validated and transformed, replacing `if` checks with assertions. Fixed default region handling in `create_endpoint` and corrected variable naming in `split_nebius_path` for consistency. These changes enhance code reliability and maintainability. * Refactor SDK initialization to use a cached global instance. Introduce a global `_sdk` variable to cache the SDK instance, preventing redundant initialization. This improves efficiency by avoiding repeated calls to `nebius.sdk.SDK()` in the `sdk()` function. The logic ensures `_sdk` is only initialized once, either with IAM credentials or a credentials file. * Update bucket URI format in mount and storage test Replaced the `bucket_uri` returned in the test with a prefixed `nebius://` format. This ensures consistency with updated storage access conventions. * format * [Jobs][UX] add -all option to jobs queue printing (#4923) * add all option * formatting * fix comments * Refactor jobs queue display logic and improve job listing * [deps] pin ibm-platform-services to >=0.48.0 to work around issue (#4939) * [api server] avoid deleting requests.db but not -wal/-shm (#4941) [api server] avoid deleting requests.db without -wal/-shm * [Test] Fix kubernetes failure tests (#4874) * resource_heavy for test_multi_tenant_managed_jobs * longer initial delay and resource_heavy * test launch fast * test again * more test * more log * more log * more log * more log * more log * restore log * remove resource heavy * restore change * longer initial delay * wait for NOT_READY for test_skyserve_rolling_update test * remove unuse import * increase the sleep to 120 * f format * fix test_managed_jobs_storage and test_kubernetes_storage_mounts * restore deleted test * restore more * remove resource_heavy * test * test again * fix azure check * test one more time * test one more time * Revert "test one more time" This reverts commit 029a3a7. * Revert "test one more time" This reverts commit fa70b8f. * Revert "test again" This reverts commit 3480116. * Revert "test" This reverts commit c695b56. * fix * add comment * no spot for kubernetes test * no spot * bigger initial delay * longer initial delay * check if its eks cluster * fix bool arg * [k8s] filter out nodes with less accelerators than requested (#4930) * filter out nodes in gke with less accelerators than requested * address comments * gpu check executes on non-tpu nodes * [Jobs] Error out for intermediate bucket on cloud not enabled (#4942) * Error out for intermediate bucket on cloud not enabled * better logging for reauth error * Add reauth exception * format * [Docs] Add docs on implementing priorities in k8s (#4803) * Add priorities page * Address comments, add to k8s setup docs * fixes * Fixes * [Docs] Minor wording changes. (#4940) * wip * updates * reword * add * [Examples] LLM/Gemma3 Example (#4937) * gemma3 * Update gemma3.yaml to specify exact versions for transformers and vllm installations; add readiness probe configuration in service section. Update README.md to correct command option from 'deepseek' to 'gemma-3'. * Remove outdated command option from README.md for clarity. * update readme for serving * Update README.md to correct the number of nodes in the service launch command and update the link to the appropriate SkyPilot YAML file. * [Doc] Gemma3 doc update (#4948) Update README and documentation to include Gemma 3 model and example. Added Gemma 3 to the news section in README.md and updated the models index in documentation. * [Docs] Update k8s volume mounting docs + refactor optional steps (#4934) * Update volume mounting docs * Update volume mounting docs * Update volume mounting docs * Add nested tabset, restructure optional steps * Move volume mounting docs * Update volume mounting docs * Reorder * casing * Comments * fix * reduce links * [Test] Simplified buildkite agent queue (#4932) * remove serve and backcompact * ignore buildkite yaml file * [Docs] Update benefits for client-server (#4945) * Update benefits for client-server * update * Update docs/source/reference/api-server/api-server.rst Co-authored-by: Zongheng Yang <[email protected]> --------- Co-authored-by: Zongheng Yang <[email protected]> * Fix flaky for test_cancel_launch_and_exec_async (#4966) * fix flaky for test_cancel_launch_and_exec_async * comma * use generic_cloud * new line format * [Docs] fix typo in gemma3 example (#4971) Signed-off-by: Aylei <[email protected]> * [k8s] better support for GKE scale-to-zero autoscaling node pools (#4935) * working codepath * remove prints and an assert * make into classes * minor changes * update codepath comment * lint * slight reformat * review feedback * autoscale_detecror -> autoscaler * unnest regions_with_offering logic * short circuit on unsupported autoscaler * formalize context name validation, add exception handling for cluster info request * account for TPUs * code hardening * remove AUTOSCALER_TO_LABEL_FORMATTER in favor of expanding AUTOSCALER_TYPE_TO_AUTOSCALER * more debug logs, review feedbacks * final review comments addressed * fix incorrect vcpu/mem checks for GKE autoscaler (#4972) * fix annotation "kubernetes.io/ingress.class" is deprecated (#4974) * fix annotation "kubernetes.io/ingress.class" is deprecated Signed-off-by: Ajay-Satish-01 <[email protected]> * fix: ingress spec based on version --------- Signed-off-by: Ajay-Satish-01 <[email protected]> * [UX] Fix dense cli for resources not enough (#4962) fix dense cli * [API server] attach setup of controllers (#4931) * [API server] attach setup of controllers Signed-off-by: Aylei <[email protected]> * lint Signed-off-by: Aylei <[email protected]> * Address review comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> * [Test] Add support for missing bashrc file in zsh shells (#4963) * add support for zsh * fix for bashrc after testing * [Docs] Fix NFS mounting docs for k8s (#4951) Add kubernetes key * [k8s] GKE support for TPU V6 (#4986) * [k8s] GKE support for TPU V6 * gke t6 support * remove wrong check * Fix test_managed_jobs_storage failure on azure in master branch (#4965) * fix * longer timeout * [Test]Separate different param into different steps on buildkite and fix flacky of test_job_queue_with_docker (#4955) * different param to different steps * longer time to sleep * [API server] cleanup executor processes on shutdown (#4912) * [API server] cleanup executor on shutdown Signed-off-by: Aylei <[email protected]> * refine Signed-off-by: Aylei <[email protected]> * just raise impossible exceptions Signed-off-by: Aylei <[email protected]> * Update sky/utils/subprocess_utils.py Co-authored-by: Zhanghao Wu <[email protected]> --------- Signed-off-by: Aylei <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> * [k8s] LRU cache for GKE can_create_new_instance_of_type (#4973) * LRU cache for can_create_new_instance_of_type * request scope * [Test]Refactor backward compatibility test (#4906) * backward compat * fix * backcompat update * generate pipeline * bug fix * remove deactivate * robust backcompact test * fix * more log * bug fix * subprocess run with bash * bug fix * update template * fix flaky * limit concurrency * pip install uv * fix * low resource * fix * bump python version to 3.10 * recreate env * import order * [Core] Independent storage check (#4977) * independent storage check * formatting * granular perms * _is_storage_cloud_enabled uses storage check * UX improvement * remove debug logs * fix local test * sky check no regression * no sky check regression, managed jobs work * api backwards compatibility * define globally minimal perms for gcp * review feedback * continue from except * [GCP] Don't require TPU support for serve:gcp if TPU support is not required (#4991) don't require tpu support for serve:gcp if tpu support is not required * followup to #4935 (#4989) * review comments * use .get() where it makes sense * [Serve] BugFix: `any_of` field order issue cause version bump to not work (#4978) * [Serve] BugFix: `any_of` field order issue cause version bump to not work * upd * [Example] Batch Inference (#4994) * initial code for batched inference * Refactor batch inference scripts and configuration files for improved consistency and clarity. Removed unused bucket name generation and monitoring service launch from `batch_compute_vectors.py`. Updated `compute_text_vectors.yaml` and `monitor_progress.yaml` to use a unified bucket name. Revised README to reflect changes in embedding generation focus and performance highlights. * formattting * Update batch inference configuration files for consistency. Renamed `compute_text_vectors` and `monitor_progress` to include `batch-inference` prefix. Revised README to enhance clarity on monitoring progress and accessing the service. * Update README for batch inference: replaced local image links with external URLs, corrected endpoint variable in monitoring instructions, and added a new image for enhanced visual representation. * Update README for batch inference: corrected image URLs to include file extensions and added a new section for further learning resources. * Enhance README for batch inference: updated the section on computing embeddings to include details about the Amazon reviews dataset and clarified the use of the `Alibaba-NLP/gte-Qwen2-7B-instruct` model for generating embeddings. * update banner --------- Signed-off-by: Aylei <[email protected]> Signed-off-by: Ajay-Satish-01 <[email protected]> Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: zpoint <[email protected]> Co-authored-by: Kaiyuan Eric Chen <[email protected]> Co-authored-by: Christopher Cooper <[email protected]> Co-authored-by: Aylei <[email protected]> Co-authored-by: chris mckenzie <[email protected]> Co-authored-by: Seung Jin <[email protected]> Co-authored-by: Ajay Satish <[email protected]> Co-authored-by: Daniel Shin <[email protected]> Co-authored-by: Tian Xia <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #4768
Tested (run the relevant ones):
bash format.sh
/smoke-test
(CI) orpytest tests/test_smoke.py
(local)/smoke-test -k test_name
(CI) orpytest tests/test_smoke.py::test_name
(local)/quicktest-core
(CI) orconda deactivate; bash -i tests/backward_compatibility_tests.sh
(local)