Skip to content

Conversation

@bbartels
Copy link
Contributor

@bbartels bbartels commented Sep 15, 2025

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the documentation Improvements or additions to documentation label Sep 15, 2025
@bbartels
Copy link
Contributor Author

@NickLucche @hmellor Apologies, messed up the commits in the previous PR. Please have a look here :)

@mergify mergify bot added the ci/build label Sep 15, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to simplify the installation of the nixl dependency by removing source compilation in favor of a pre-built package. While this is a good goal, the implementation has a critical issue in the Dockerfile where it attempts to install an Ubuntu 24.04 package on an Ubuntu 22.04 base image, which will likely break the build. The documentation has also been updated with installation instructions that are not portable and will fail for users on different systems. I've provided detailed feedback and suggestions to address these issues.

Comment on lines 443 to 445
RUN curl https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/ubuntu24_04/x64/libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb --output libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb \
&& apt install ./libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb \
&& rm libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a critical issue with the installation of libgdrapi. The Docker image is based on Ubuntu 22.04 (from FINAL_BASE_IMAGE=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04), but the .deb package being downloaded is for Ubuntu 24.04. This version mismatch will likely cause the build to fail or lead to runtime issues due to dependency conflicts.

Additionally, the CUDA version (12.8) is hardcoded in the URL, which makes it brittle if the CUDA_VERSION build argument changes.

NVIDIA does not seem to provide a pre-built libgdrapi package for CUDA 12.8 on Ubuntu 22.04. To resolve this, you could consider one of the following:

  1. Change the FINAL_BASE_IMAGE to an Ubuntu 24.04-based image, like nvidia/cuda:${CUDA_VERSION}-devel-ubuntu24.04, and verify that all other dependencies are compatible.
  2. Revert to compiling gdrcopy from source for Ubuntu 22.04, even though the goal of this PR is to remove source compilation.
  3. Make the URL and package name dynamic based on the OS and CUDA version, and add logic to handle cases where a pre-built package is not available.

Given the options, switching the base image to Ubuntu 24.04 might be the most straightforward solution if it doesn't introduce other compatibility problems.

Comment on lines 17 to 19
RUN curl https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/ubuntu24_04/x64/libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb --output libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb \
&& apt install ./libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb \
&& rm libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The provided installation command for the gdrcopy dependency is highly specific to a single environment (Ubuntu 24.04 with CUDA 12.8) and will fail for most users. This can be very frustrating.

The documentation should provide more general guidance. For example, you could instruct users to install libgdrapi and link to the official gdrcopy installation page, while providing the specific command as an example for a particular setup.

Also, using RUN suggests a Docker command, which might be confusing for users setting up a host environment. It would be clearer to use commands appropriate for a user's shell (e.g., with sudo).

Comment on lines 207 to 209
RUN curl https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/ubuntu24_04/x64/libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb --output libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb \
&& apt install ./libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb \
&& rm libgdrapi_2.5.1-1_amd64.Ubuntu24_04.deb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the previous comment, this installation command for gdrcopy is not portable and is specific to Ubuntu 24.04 with CUDA 12.8. This will cause issues for users with different setups. Please provide more general installation guidance, for instance by pointing to the official gdrcopy documentation and using this command as a specific example.

@hmellor
Copy link
Member

hmellor commented Sep 15, 2025

Gemini's comments seem valid

Signed-off-by: bbartels <[email protected]>
@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

Signed-off-by: bbartels <[email protected]>
@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@bbartels
Copy link
Contributor Author

Gemini's comments seem valid

Added correct version of OS/uarch to dockefile

@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

Signed-off-by: bbartels <[email protected]>
@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

Signed-off-by: bbartels <[email protected]>
@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@bbartels bbartels requested a review from NickLucche September 15, 2025 22:35
@bbartels
Copy link
Contributor Author

@NickLucche @hmellor Any further comments? Seems to have all passed now!

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your work @bbartels @cjackal.
Just left one note on the docs about GDRCopy being optional.

1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](gh-file:tools/ep_kernels).
2. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation).
3. **For disaggregated serving**: Install UCX and NIXL following the [script](gh-file:tools/install_nixl.sh).
3. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](gh-file:tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right thanks!

### Setup Steps

1. **Install KV Connector**: Install NIXL using the [installation script](gh-file:tools/install_nixl.sh)
1. **Install gdrcopy/ucx/nixl**: Run the [`install_gdrcopy.sh`](gh-file:tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). `nixl` and `ucx` are installed as dependencies via pip.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should mention gdrcopy is for maximum performance, this should work even without it (plain pip install nixl)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended :)

@hmellor hmellor enabled auto-merge (squash) September 16, 2025 10:54
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 16, 2025
@hmellor hmellor merged commit 64ad551 into vllm-project:main Sep 17, 2025
79 checks passed
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: bbartels <[email protected]>
Signed-off-by: Benjamin Bartels <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Daniele <[email protected]>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: bbartels <[email protected]>
Signed-off-by: Benjamin Bartels <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Daniele <[email protected]>
Signed-off-by: charlifu <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: bbartels <[email protected]>
Signed-off-by: Benjamin Bartels <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Daniele <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Signed-off-by: bbartels <[email protected]>
Signed-off-by: Benjamin Bartels <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Daniele <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: bbartels <[email protected]>
Signed-off-by: Benjamin Bartels <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Daniele <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants