-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Removes source compilation of nixl dependency #24874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: bbartels <[email protected]>
|
@NickLucche @hmellor Apologies, messed up the commits in the previous PR. Please have a look here :) |
|
No ciflow labels are configured for this repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to simplify the installation of the nixl dependency by removing source compilation in favor of a pre-built package. While this is a good goal, the implementation has a critical issue in the Dockerfile where it attempts to install an Ubuntu 24.04 package on an Ubuntu 22.04 base image, which will likely break the build. The documentation has also been updated with installation instructions that are not portable and will fail for users on different systems. I've provided detailed feedback and suggestions to address these issues.
docker/Dockerfile
Outdated
| RUN curl https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/ubuntu24_04/x64/libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb --output libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb \ | ||
| && apt install ./libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb \ | ||
| && rm libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a critical issue with the installation of libgdrapi. The Docker image is based on Ubuntu 22.04 (from FINAL_BASE_IMAGE=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04), but the .deb package being downloaded is for Ubuntu 24.04. This version mismatch will likely cause the build to fail or lead to runtime issues due to dependency conflicts.
Additionally, the CUDA version (12.8) is hardcoded in the URL, which makes it brittle if the CUDA_VERSION build argument changes.
NVIDIA does not seem to provide a pre-built libgdrapi package for CUDA 12.8 on Ubuntu 22.04. To resolve this, you could consider one of the following:
- Change the
FINAL_BASE_IMAGEto an Ubuntu 24.04-based image, likenvidia/cuda:${CUDA_VERSION}-devel-ubuntu24.04, and verify that all other dependencies are compatible. - Revert to compiling
gdrcopyfrom source for Ubuntu 22.04, even though the goal of this PR is to remove source compilation. - Make the URL and package name dynamic based on the OS and CUDA version, and add logic to handle cases where a pre-built package is not available.
Given the options, switching the base image to Ubuntu 24.04 might be the most straightforward solution if it doesn't introduce other compatibility problems.
| RUN curl https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/ubuntu24_04/x64/libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb --output libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb \ | ||
| && apt install ./libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb \ | ||
| && rm libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The provided installation command for the gdrcopy dependency is highly specific to a single environment (Ubuntu 24.04 with CUDA 12.8) and will fail for most users. This can be very frustrating.
The documentation should provide more general guidance. For example, you could instruct users to install libgdrapi and link to the official gdrcopy installation page, while providing the specific command as an example for a particular setup.
Also, using RUN suggests a Docker command, which might be confusing for users setting up a host environment. It would be clearer to use commands appropriate for a user's shell (e.g., with sudo).
| RUN curl https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/ubuntu24_04/x64/libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb --output libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb \ | ||
| && apt install ./libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb \ | ||
| && rm libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous comment, this installation command for gdrcopy is not portable and is specific to Ubuntu 24.04 with CUDA 12.8. This will cause issues for users with different setups. Please provide more general installation guidance, for instance by pointing to the official gdrcopy documentation and using this command as a specific example.
|
Gemini's comments seem valid |
Signed-off-by: bbartels <[email protected]>
|
No ciflow labels are configured for this repo. |
Signed-off-by: bbartels <[email protected]>
|
No ciflow labels are configured for this repo. |
Added correct version of OS/uarch to dockefile |
Signed-off-by: bbartels <[email protected]>
|
No ciflow labels are configured for this repo. |
Signed-off-by: bbartels <[email protected]>
|
No ciflow labels are configured for this repo. |
Signed-off-by: bbartels <[email protected]>
|
No ciflow labels are configured for this repo. |
|
No ciflow labels are configured for this repo. |
Signed-off-by: Benjamin Bartels <[email protected]>
Signed-off-by: bbartels <[email protected]>
Signed-off-by: Benjamin Bartels <[email protected]>
|
@NickLucche @hmellor Any further comments? Seems to have all passed now! |
NickLucche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](gh-file:tools/ep_kernels). | ||
| 2. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation). | ||
| 3. **For disaggregated serving**: Install UCX and NIXL following the [script](gh-file:tools/install_nixl.sh). | ||
| 3. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](gh-file:tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right thanks!
| ### Setup Steps | ||
|
|
||
| 1. **Install KV Connector**: Install NIXL using the [installation script](gh-file:tools/install_nixl.sh) | ||
| 1. **Install gdrcopy/ucx/nixl**: Run the [`install_gdrcopy.sh`](gh-file:tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). `nixl` and `ucx` are installed as dependencies via pip. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should mention gdrcopy is for maximum performance, this should work even without it (plain pip install nixl)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amended :)
Signed-off-by: Benjamin Bartels <[email protected]>
Signed-off-by: bbartels <[email protected]> Signed-off-by: Benjamin Bartels <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Daniele <[email protected]>
Signed-off-by: bbartels <[email protected]> Signed-off-by: Benjamin Bartels <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Daniele <[email protected]> Signed-off-by: charlifu <[email protected]>
Signed-off-by: bbartels <[email protected]> Signed-off-by: Benjamin Bartels <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Daniele <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: bbartels <[email protected]> Signed-off-by: Benjamin Bartels <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Daniele <[email protected]>
Signed-off-by: bbartels <[email protected]> Signed-off-by: Benjamin Bartels <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Daniele <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.