-
Notifications
You must be signed in to change notification settings - Fork 633
[Nebius] Nebius Object Storage support. #4838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @SalikovAlex ! This is exciting. Left some discussions ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @SalikovAlex ! The PR mostly looks good to me. Left some discussions - after that it should be ready to go!
Could you also help run the related smoke test on nebius?
sky/cloud_stores.py
Outdated
# To increase parallelism, modify max_concurrent_requests in your | ||
# aws config file (Default path: ~/.aws/config). | ||
endpoint_url = nebius.create_endpoint() | ||
if 'nebius://' in source: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it starts with s3://, isn't it be treated as S3 storage and route to the S3Store class?
Looks like you are right.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SalikovAlex for fixing this! Left final comments. After addressing those, could you help running the related smoke test on your side?
sky/cloud_stores.py
Outdated
# To increase parallelism, modify max_concurrent_requests in your | ||
# aws config file (Default path: ~/.aws/config). | ||
endpoint_url = nebius.create_endpoint() | ||
if 'nebius://' in source: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
/quicktest-core |
/smoke-test -k storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @SalikovAlex ! This is amazing. It mostly looks great to me, besides the quicktest-core failing. Though this does not seems like caused by this PR. @zpoint can you help confirming if the backward compatibility test is failing on master as well?
The pip issue is fixed in #4939. You can merge master and try again. |
This commit introduces support for Nebius object storage, enabling users to integrate Nebius storage with various functionalities such as file mounting, syncing, and cloud transfers. It includes necessary adaptors, utility methods, and updates to the storage framework to handle Nebius-specific configurations and endpoints.
Implemented new Nebius storage mounting functionality, including COPY and MOUNT modes, with tests. Updated cloud_stores.py for AWS CLI compatibility and added a YAML template for Nebius storage configurations. Removed outdated Nebius test case in favor of the new approach.
This commit introduces comprehensive Nebius support, making it accessible for S3-compatible operations including bucket creation, deletion, and mounting. It removes reliance on the AWS_SHARED_CREDENTIALS_FILE environment variable, streamlines Nebius-specific configurations, and adds associated unit test parameters to validate functionality across storage operations.
Remove redundant code, streamline imports, and enhance error messaging. Adjust documentation for better accuracy and update function annotations. These changes improve maintainability and readability of the Nebius adaptor module.
Simplified Nebius AWS CLI installation by reusing the S3CloudStorage configuration for consistency. Removed unnecessary debug print in `run_upload_cli` to reduce console noise. Minor formatting adjustment in test YAML file.
Clean up and improve code readability, including string formatting and conditionals. Introduce `_TIMEOUT_TO_PROPAGATES` to handle timeout while verifying Nebius bucket deletions. Update comments to reflect the corrected usage on Nebius servers.
Removed unused variable from subprocess call to clean up code. Updated timeout error to include the bucket name for more detailed and helpful error reporting.
Updated the helper method to assign a default region when no region is specified, ensuring compatibility with Nebius Object Storage. This change avoids potential errors caused by missing region values.
Replace 'nebius://' with 's3://' in source paths to ensure compatibility with AWS CLI commands. This allows seamless integration of Nebius storage endpoints.
* gemma3 * Update gemma3.yaml to specify exact versions for transformers and vllm installations; add readiness probe configuration in service section. Update README.md to correct command option from 'deepseek' to 'gemma-3'. * Remove outdated command option from README.md for clarity. * update readme for serving * Update README.md to correct the number of nodes in the service launch command and update the link to the appropriate SkyPilot YAML file.
Update README and documentation to include Gemma 3 model and example. Added Gemma 3 to the news section in README.md and updated the models index in documentation.
…pilot-org#4934) * Update volume mounting docs * Update volume mounting docs * Update volume mounting docs * Add nested tabset, restructure optional steps * Move volume mounting docs * Update volume mounting docs * Reorder * casing * Comments * fix * reduce links
* remove serve and backcompact * ignore buildkite yaml file
* Update benefits for client-server * update * Update docs/source/reference/api-server/api-server.rst Co-authored-by: Zongheng Yang <[email protected]> --------- Co-authored-by: Zongheng Yang <[email protected]>
* fix flaky for test_cancel_launch_and_exec_async * comma * use generic_cloud * new line format
Signed-off-by: Aylei <[email protected]>
…ypilot-org#4935) * working codepath * remove prints and an assert * make into classes * minor changes * update codepath comment * lint * slight reformat * review feedback * autoscale_detecror -> autoscaler * unnest regions_with_offering logic * short circuit on unsupported autoscaler * formalize context name validation, add exception handling for cluster info request * account for TPUs * code hardening * remove AUTOSCALER_TO_LABEL_FORMATTER in favor of expanding AUTOSCALER_TYPE_TO_AUTOSCALER * more debug logs, review feedbacks * final review comments addressed
…org#4974) * fix annotation "kubernetes.io/ingress.class" is deprecated Signed-off-by: Ajay-Satish-01 <[email protected]> * fix: ingress spec based on version --------- Signed-off-by: Ajay-Satish-01 <[email protected]>
* [API server] attach setup of controllers Signed-off-by: Aylei <[email protected]> * lint Signed-off-by: Aylei <[email protected]> * Address review comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]>
…g#4963) * add support for zsh * fix for bashrc after testing
Add kubernetes key
* [k8s] GKE support for TPU V6 * gke t6 support * remove wrong check
…ilot-org#4965) * fix * longer timeout
…fix flacky of test_job_queue_with_docker (skypilot-org#4955) * different param to different steps * longer time to sleep
* [API server] cleanup executor on shutdown Signed-off-by: Aylei <[email protected]> * refine Signed-off-by: Aylei <[email protected]> * just raise impossible exceptions Signed-off-by: Aylei <[email protected]> * Update sky/utils/subprocess_utils.py Co-authored-by: Zhanghao Wu <[email protected]> --------- Signed-off-by: Aylei <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]>
…#4973) * LRU cache for can_create_new_instance_of_type * request scope
* backward compat * fix * backcompat update * generate pipeline * bug fix * remove deactivate * robust backcompact test * fix * more log * bug fix * subprocess run with bash * bug fix * update template * fix flaky * limit concurrency * pip install uv * fix * low resource * fix * bump python version to 3.10 * recreate env * import order
* independent storage check * formatting * granular perms * _is_storage_cloud_enabled uses storage check * UX improvement * remove debug logs * fix local test * sky check no regression * no sky check regression, managed jobs work * api backwards compatibility * define globally minimal perms for gcp * review feedback * continue from except
…equired (skypilot-org#4991) don't require tpu support for serve:gcp if tpu support is not required
* review comments * use .get() where it makes sense
…work (skypilot-org#4978) * [Serve] BugFix: `any_of` field order issue cause version bump to not work * upd
* initial code for batched inference * Refactor batch inference scripts and configuration files for improved consistency and clarity. Removed unused bucket name generation and monitoring service launch from `batch_compute_vectors.py`. Updated `compute_text_vectors.yaml` and `monitor_progress.yaml` to use a unified bucket name. Revised README to reflect changes in embedding generation focus and performance highlights. * formattting * Update batch inference configuration files for consistency. Renamed `compute_text_vectors` and `monitor_progress` to include `batch-inference` prefix. Revised README to enhance clarity on monitoring progress and accessing the service. * Update README for batch inference: replaced local image links with external URLs, corrected endpoint variable in monitoring instructions, and added a new image for enhanced visual representation. * Update README for batch inference: corrected image URLs to include file extensions and added a new section for further learning resources. * Enhance README for batch inference: updated the section on computing embeddings to include details about the Amazon reviews dataset and clarified the use of the `Alibaba-NLP/gte-Qwen2-7B-instruct` model for generating embeddings. * update banner
9454a41
to
ed366c0
Compare
/quicktest-core |
Thanks @SalikovAlex for adding this! LGTM. Merging now :)) |
Tested (run the relevant ones):
bash format.sh
pytest tests/smoke_tests/test_mount_and_storage.py::test_nebius_storage_mounts --nebius
pytest tests/test_smoke.py
pytest tests/smoke_tests/test_mount_and_storage.py::test_nebius_storage_mounts --nebius