Skip to content

[Nebius] Nebius Object Storage support. #4838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 93 commits into from
Mar 20, 2025

Conversation

SalikovAlex
Copy link
Contributor

@SalikovAlex SalikovAlex commented Feb 27, 2025

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    pytest tests/smoke_tests/test_mount_and_storage.py::test_nebius_storage_mounts --nebius
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests:
    pytest tests/smoke_tests/test_mount_and_storage.py::test_nebius_storage_mounts --nebius
  • pytest tests/smoke_tests/test_mount_and_storage.py::TestStorageWithCredentials::test_externally_created_bucket_mount_without_source --nebius

@SalikovAlex SalikovAlex marked this pull request as draft February 27, 2025 14:19
@SalikovAlex SalikovAlex marked this pull request as ready for review February 28, 2025 19:47
@Michaelvll Michaelvll requested a review from cblmemo February 28, 2025 21:58
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @SalikovAlex ! This is exciting. Left some discussions ;)

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @SalikovAlex ! The PR mostly looks good to me. Left some discussions - after that it should be ready to go!

Could you also help run the related smoke test on nebius?

# To increase parallelism, modify max_concurrent_requests in your
# aws config file (Default path: ~/.aws/config).
endpoint_url = nebius.create_endpoint()
if 'nebius://' in source:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it starts with s3://, isn't it be treated as S3 storage and route to the S3Store class?

@SalikovAlex
Copy link
Contributor Author

Looks like you are right.
I am trying to guess while in R2 we are using

if 'r2://' in source:
    source = source.replace('r2://', 's3://')

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SalikovAlex for fixing this! Left final comments. After addressing those, could you help running the related smoke test on your side?

# To increase parallelism, modify max_concurrent_requests in your
# aws config file (Default path: ~/.aws/config).
endpoint_url = nebius.create_endpoint()
if 'nebius://' in source:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image I think here we also assume it can only start with `nebius://`

@SalikovAlex
Copy link
Contributor Author

SalikovAlex commented Mar 14, 2025

pytest tests/smoke_tests/test_mount_and_storage.py --nebius All tests are green

@cblmemo
Copy link
Collaborator

cblmemo commented Mar 14, 2025

/quicktest-core

@cblmemo
Copy link
Collaborator

cblmemo commented Mar 14, 2025

/smoke-test -k storage

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @SalikovAlex ! This is amazing. It mostly looks great to me, besides the quicktest-core failing. Though this does not seems like caused by this PR. @zpoint can you help confirming if the backward compatibility test is failing on master as well?

@zpoint
Copy link
Collaborator

zpoint commented Mar 20, 2025

The pip issue is fixed in #4939. You can merge master and try again.

This commit introduces support for Nebius object storage, enabling users to integrate Nebius storage with various functionalities such as file mounting, syncing, and cloud transfers. It includes necessary adaptors, utility methods, and updates to the storage framework to handle Nebius-specific configurations and endpoints.
Implemented new Nebius storage mounting functionality, including COPY and MOUNT modes, with tests. Updated cloud_stores.py for AWS CLI compatibility and added a YAML template for Nebius storage configurations. Removed outdated Nebius test case in favor of the new approach.
This commit introduces comprehensive Nebius support, making it accessible for S3-compatible operations including bucket creation, deletion, and mounting. It removes reliance on the AWS_SHARED_CREDENTIALS_FILE environment variable, streamlines Nebius-specific configurations, and adds associated unit test parameters to validate functionality across storage operations.
Remove redundant code, streamline imports, and enhance error messaging. Adjust documentation for better accuracy and update function annotations. These changes improve maintainability and readability of the Nebius adaptor module.
Simplified Nebius AWS CLI installation by reusing the S3CloudStorage configuration for consistency. Removed unnecessary debug print in `run_upload_cli` to reduce console noise. Minor formatting adjustment in test YAML file.
Clean up and improve code readability, including string formatting and conditionals. Introduce `_TIMEOUT_TO_PROPAGATES` to handle timeout while verifying Nebius bucket deletions. Update comments to reflect the corrected usage on Nebius servers.
Removed unused variable from subprocess call to clean up code. Updated timeout error to include the bucket name for more detailed and helpful error reporting.
Updated the helper method to assign a default region when no region is specified, ensuring compatibility with Nebius Object Storage. This change avoids potential errors caused by missing region values.
Replace 'nebius://' with 's3://' in source paths to ensure compatibility with AWS CLI commands. This allows seamless integration of Nebius storage endpoints.
KeplerC and others added 25 commits March 20, 2025 09:36
* gemma3

* Update gemma3.yaml to specify exact versions for transformers and vllm installations; add readiness probe configuration in service section. Update README.md to correct command option from 'deepseek' to 'gemma-3'.

* Remove outdated command option from README.md for clarity.

* update readme for serving

* Update README.md to correct the number of nodes in the service launch command and update the link to the appropriate SkyPilot YAML file.
Update README and documentation to include Gemma 3 model and example. Added Gemma 3 to the news section in README.md and updated the models index in documentation.
…pilot-org#4934)

* Update volume mounting docs

* Update volume mounting docs

* Update volume mounting docs

* Add nested tabset, restructure optional steps

* Move volume mounting docs

* Update volume mounting docs

* Reorder

* casing

* Comments

* fix

* reduce links
* remove serve and backcompact

* ignore buildkite yaml file
* Update benefits for client-server

* update

* Update docs/source/reference/api-server/api-server.rst

Co-authored-by: Zongheng Yang <[email protected]>

---------

Co-authored-by: Zongheng Yang <[email protected]>
* fix flaky for test_cancel_launch_and_exec_async

* comma

* use generic_cloud

* new line format
…ypilot-org#4935)

* working codepath

* remove prints and an assert

* make into classes

* minor changes

* update codepath comment

* lint

* slight reformat

* review feedback

* autoscale_detecror -> autoscaler

* unnest regions_with_offering logic

* short circuit on unsupported autoscaler

* formalize context name validation, add exception handling for cluster info request

* account for TPUs

* code hardening

* remove AUTOSCALER_TO_LABEL_FORMATTER in favor of expanding AUTOSCALER_TYPE_TO_AUTOSCALER

* more debug logs, review feedbacks

* final review comments addressed
…org#4974)

* fix annotation "kubernetes.io/ingress.class" is deprecated

Signed-off-by: Ajay-Satish-01 <[email protected]>

* fix: ingress spec based on version

---------

Signed-off-by: Ajay-Satish-01 <[email protected]>
* [API server] attach setup of controllers

Signed-off-by: Aylei <[email protected]>

* lint

Signed-off-by: Aylei <[email protected]>

* Address review comments

Signed-off-by: Aylei <[email protected]>

---------

Signed-off-by: Aylei <[email protected]>
…g#4963)

* add support for zsh

* fix for bashrc after testing
* [k8s] GKE support for TPU V6

* gke t6 support

* remove wrong check
…fix flacky of test_job_queue_with_docker (skypilot-org#4955)

* different param to different steps

* longer time to sleep
* [API server] cleanup executor on shutdown

Signed-off-by: Aylei <[email protected]>

* refine

Signed-off-by: Aylei <[email protected]>

* just raise impossible exceptions

Signed-off-by: Aylei <[email protected]>

* Update sky/utils/subprocess_utils.py

Co-authored-by: Zhanghao Wu <[email protected]>

---------

Signed-off-by: Aylei <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
…#4973)

* LRU cache for can_create_new_instance_of_type

* request scope
* backward compat

* fix

* backcompat update

* generate pipeline

* bug fix

* remove deactivate

* robust backcompact test

* fix

* more log

* bug fix

* subprocess run with bash

* bug fix

* update template

* fix flaky

* limit concurrency

* pip install uv

* fix

* low resource

* fix

* bump python version to 3.10

* recreate env

* import order
* independent storage check

* formatting

* granular perms

* _is_storage_cloud_enabled uses storage check

* UX improvement

* remove debug logs

* fix local test

* sky check no regression

* no sky check regression, managed jobs work

* api backwards compatibility

* define globally minimal perms for gcp

* review feedback

* continue from except
…equired (skypilot-org#4991)

don't require tpu support for serve:gcp if tpu support is not required
* review comments

* use .get() where it makes sense
…work (skypilot-org#4978)

* [Serve] BugFix: `any_of` field order issue cause version bump to not work

* upd
* initial code for batched inference

* Refactor batch inference scripts and configuration files for improved consistency and clarity. Removed unused bucket name generation and monitoring service launch from `batch_compute_vectors.py`. Updated `compute_text_vectors.yaml` and `monitor_progress.yaml` to use a unified bucket name. Revised README to reflect changes in embedding generation focus and performance highlights.

* formattting

* Update batch inference configuration files for consistency. Renamed `compute_text_vectors` and `monitor_progress` to include `batch-inference` prefix. Revised README to enhance clarity on monitoring progress and accessing the service.

* Update README for batch inference: replaced local image links with external URLs, corrected endpoint variable in monitoring instructions, and added a new image for enhanced visual representation.

* Update README for batch inference: corrected image URLs to include file extensions and added a new section for further learning resources.

* Enhance README for batch inference: updated the section on computing embeddings to include details about the Amazon reviews dataset and clarified the use of the `Alibaba-NLP/gte-Qwen2-7B-instruct` model for generating embeddings.

* update banner
@zpoint
Copy link
Collaborator

zpoint commented Mar 20, 2025

/quicktest-core

@cblmemo
Copy link
Collaborator

cblmemo commented Mar 20, 2025

Thanks @SalikovAlex for adding this! LGTM. Merging now :))

@cblmemo cblmemo merged commit f727408 into skypilot-org:master Mar 20, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.