Skip to content

[Core] Support ARM architecture #4835

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Mar 1, 2025
Merged

[Core] Support ARM architecture #4835

merged 28 commits into from
Mar 1, 2025

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Feb 27, 2025

Mitigates #4793

Fixes #4601

This enables the SkyPilot clusters to run on arm architecture with the default image.

ARM support is becoming more important as the NV's GH200, GB200 (offered by Lambda clouds, GCP) come with ARM CPUs by default, but our docker image does not support ARM CPU well.

Original k8s gpu image does not support ARM architecture

sky launch --cloud lambda --gpus gh200 nvidia-smi --image-id docker:us-docker.pkg.dev/sky-dev-465/skypilotk8s/skypilot-gpu:latest -c test-lambda-arm-origin

Error:

latest: Pulling from sky-dev-465/skypilotk8s/skypilot-gpu
no matching manifest for linux/arm64/v8 in the manifest list entries

⨯ Failed to set up SkyPilot runtime on cluster.  View logs: sky api logs -l sky-2025-02-27-22-36-28-562705/provision.log
sky.exceptions.CommandError: Command docker pull us-docker.pkg.dev/sky-dev-465/skypilotk8s/skypilot-gpu:latest failed with return code 1.
Failed to run docker setup commands.

TODO (future PRs):

  • Make SkyPilot runtime dependency installation compatible with ARM
  • Make bucket dependency compatible with ARM
  • Kubernetes GPU image does not have base env activate by default, while the CPU image does
  • Make sure the API server helm chart image is arm compatible
  • Make sure sky local up --ips works for remote ARM machines
  • Create SkyPilot custom ARM image on different clouds, e.g. sky launch -t c6g.large works if we specify --image-id with a ARM based deep learning image, but does not work with our default image

TODO before merging:

  • make the new k8s image to be the latest tag.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • cd sky/clouds/service_catalogs/images/; ./skypilot-k8s-image.sh -p -g
    • cd sky/clouds/service_catalogs/images/; ./skypilot-k8s-image.sh -p
    • sky launch --cloud kubernetes --cpus 1 --image-id docker:us-docker.pkg.dev/sky-dev-465/skypilotk8s/skypilot-gpu:20250227 echo hi
    • sky autostop sky-b694-ubuntu --down -i 0
    • sky launch --cloud kubernetes --cpus 1 --image-id docker:us-docker.pkg.dev/sky-dev-465/skypilotk8s/skypilot:20250227 echo hi
    • sky autostop sky-b694-ubuntu --down -i 0
    • docker build . on Mac and AMD64 linux
    • sky launch --cloud lambda --gpus Gh200 --image-id docker:us-docker.pkg.dev/sky-dev-465/skypilotk8s/skypilot-gpu:20250227 nvidia-smi -c test-lambda-arm, ssh into the machine and uname -m shows aarch64, i.e. arm architecture
    • Launch on ARM arch:
      • sky launch --gpus Gh200 --cloud lambda examples/using_file_mounts.yaml -c test-fm-arm --down
      • Deploy local up on the machine: sky local up --ips
      • Deploy API server on the kubernetes cluster deployed above
      • sky launch --gpus gh200 --cloud kubernetes nvidia-smi
  • All smoke tests: pytest tests/test_smoke.py
    • pytest tests/test_smoke.py --kubernetes with the new images
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll Michaelvll marked this pull request as ready for review February 27, 2025 23:22
Dockerfile Outdated
# Install kubectl based on architecture
ARCH=$(uname -m) && \
if [ "$ARCH" = "x86_64" ]; then \
curl -LO "https://dl.k8s.io/release/v1.31.6/bin/linux/amd64/kubectl"; \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this can be simplified as curl -LO "https://dl.k8s.io/release/v1.31.6/bin/linux/${TARGETARCH}/kubectl";

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Simplified the impl and make it more general

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @Michaelvll!

@Michaelvll
Copy link
Collaborator Author

/smoke-test --kubernetes

@Michaelvll Michaelvll changed the title [Cloud] Make k8s image support ARM architecture [Core] Support ARM architecture Feb 28, 2025
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Super excited to make GH200s go brrrr

Copy link
Collaborator

@aylei aylei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @Michaelvll !

@Michaelvll Michaelvll enabled auto-merge (squash) March 1, 2025 09:26
@Michaelvll Michaelvll merged commit cefc238 into master Mar 1, 2025
18 checks passed
@Michaelvll Michaelvll deleted the build-gpu-image-for-arm branch March 1, 2025 09:30
@cg505 cg505 mentioned this pull request Apr 14, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Lambda] Remove local_ray dependency for lambda
3 participants