-
Notifications
You must be signed in to change notification settings - Fork 4.7k
For AI vLLM example, add more cloud provider specific information #568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/assign @janetkuo |
| # - GKE | ||
| # nodeSelector: | ||
| # cloud.google.com/gke-accelerator: nvidia-l4 | ||
| # cloud.google.com/gke-gpu-driver-version: latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open question: Is it recommended to use default or latest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I am changing to default. While the following documentation recommends latest, I believe default makes more sense in this case because it should be more stable.
Both latest and default can be correct, but they specify different driver installation behaviors. For most cases, latest is the recommended choice.
Understanding the Difference
When you create a GPU node pool in GKE, you can choose how the NVIDIA drivers are managed. The nodeSelector label in your workload must match the configuration on the node.
🚀 latest: This label targets nodes where GKE automatically installs and updates to the latest stable driver version available for your GKE version. This is the best option for most users as it ensures you have the most recent performance improvements and security patches without manual intervention.
🛡️ default: This label targets nodes that use the default driver version for your GKE version. This version is more static and will not change automatically, providing a more stable target if your application has a strict dependency on a specific driver version. You would use this to prevent unexpected driver updates from affecting your workload.
| Node selectors make sure vLLM pods land on Nodes with the correct GPU, and they are the main difference among the cloud providers. The following are node selector examples for three cloud providers. | ||
|
|
||
| - GKE | ||
| This `nodeSelector` uses labels that are specific to Google Kubernetes Engine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't rendered well (the newline isn't rendered)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fixed. Please let me know what you think.
ai/vllm-deployment/README.md
Outdated
|
|
||
| --- | ||
|
|
||
| ## Cloud Provider Differences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are platforms that aren't public clouds, such as on-prem vendors. Suggest making the title more inclusive, such as "Platform-Specific Configuration" or "GPU Node Selection on Different Platforms"
| ## Cloud Provider Differences | |
| ## Platform-Specific Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this (as well as the link). Please let me know what you think.
ai/vllm-deployment/README.md
Outdated
| cloud.google.com/gke-accelerator: nvidia-l4 | ||
| cloud.google.com/gke-gpu-driver-version: latest | ||
| ``` | ||
| - EKS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valuable addition for making this example useful on other clouds. My one question is about long-term maintenance. Since our team's expertise is primarily with GKE, how can we ensure the configurations for other platforms stay up-to-date? Perhaps we could add a note welcoming community contributions to maintain them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question that I don't have an answer to at the moment. In a separate PR, I could add a CONTRIBUTING.md, which lays out our expectations for maintenance from each cloud provider (and/or bare metal). Let's discuss this in person.
df17639 to
e78f5a1
Compare
e78f5a1 to
a5b001f
Compare
janetkuo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: janetkuo, seans3 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Platform-Specific Configurationsection with relevant information to README.mdnodeSelectorexamples for three cloud providers to vllm-deployment.yaml