[Bugfix] Improve GPU validation logging in Ray fallback scenarios #25775

sairampillai · 2025-09-26T17:22:58Z

[Bugfix] Improve GPU validation logging in Ray fallback scenarios

Adds early GPU count validation and clearer Ray placement error messages when tensor_parallel_size exceeds available GPUs to address poor logging and help users diagnose K8s deployment failures.

Related Issues

Fixes #25263

Purpose

Fixes poor logging when tensor_parallel_size exceeds available GPUs in Ray fallback scenarios.

When tensor_parallel_size is set higher than the available GPU count (e.g., tensor_parallel_size=4 with only 1 GPU), vLLM silently falls back to Ray executor without adequate warning. This causes confusing error messages in K8s deployments, where users see Ray placement group timeout errors without understanding the root cause.

Changes Made

Early GPU validation in vllm/config/parallel.py: Added warning when tensor parallel size exceeds available GPUs during backend selection
Enhanced Ray placement error messages in vllm/executor/ray_utils.py: Improved error messages in _wait_until_pg_ready() and initialize_ray_cluster() functions to provide context about GPU resource mismatches

Files Modified

vllm/config/parallel.py - Added GPU count validation with clear warnings
vllm/executor/ray_utils.py - Enhanced Ray placement group error handling

Test Plan

Scenario Testing

Single GPU scenario with multi-GPU tensor parallel: Test with --tensor-parallel-size 4 on a system with only 1 available GPU
K8s GPU resource mismatch: Verify error messages in constrained K8s environments where pod requests only 1 GPU but tensor parallel size > 1
Normal operation: Ensure no impact when GPU resources match tensor parallel requirements

Test Commands

# Test 1: Check warning when tensor_parallel_size > available GPUs
python -c "
from vllm.config.parallel import ParallelConfig
from vllm.logger import init_logger
import logging
logging.basicConfig(level=logging.WARNING)
config = ParallelConfig(tensor_parallel_size=4)
print('Config test completed')
"

# Test 2: Ray integration test (requires multi-GPU setup)
PYTHONPATH=. python examples/offline_inference.py \
  --model microsoft/DialoGPT-small \
  --prompt "Hello world" \
  --tensor-parallel-size 2  # Will trigger validation if only 1 GPU

Functional Testing

Verify warning messages appear at correct configuration stages
Ensure normal operation remains unaffected with properly configured GPU resources
Test Ray cluster initialization warning when GPU mismatch detected

Test Result

Before Fix

No early warning when tensor_parallel_size exceeds available GPUs

Cryptic Ray placement group timeout errors:

ValueError: Cannot provide a placement group of 'placement_group_specs=...' within 2550 seconds

After Fix

Early warning during configuration:

WARNING: Tensor parallel size (4) exceeds available GPUs (1). This will likely cause issues. Consider reducing tensor_parallel_size to 1 or less...

Enhanced Ray placement error with actionable guidance:

ValueError: Cannot provide a placement group requiring 4 GPUs (...) within 2550 seconds.
Tensor parallel size may exceed available GPUs in your cluster. Check resources with `ray status` and `ray list nodes`.
If running on K8s with limited GPUs, consider reducing --tensor-parallel-size to match available GPU resources.

Validation Results

✅ Code quality checks passed: pre-commit hooks, format checks, lint checks
✅ Backward compatibility preserved: No breaking changes to existing behavior
✅ Enhanced user experience: Clear error messages guide users to resolution
✅ K8s scenario targeted: Specific guidance for Kubernetes deployment issues

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Adds early GPU count validation and clearer Ray placement error messages when tensor_parallel_size exceeds available GPUs to address poor logging and help users diagnose K8s deployment failures. Signed-off-by: Sairam Pillai <[email protected]>

github-actions · 2025-09-26T17:23:07Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

robertgshaw2-redhat · 2025-09-26T21:16:50Z

instead of improving the log, can we just not allow vllm to start? I don't quite understand why the behavior of falling back to ray is something that is needed

cjackal · 2025-09-27T15:34:17Z

I also do think that this silent fallback behavior is not only confusing but also pretty dangerous in the sense that the model server maintainer's small typo in deployment configuration results in complete failure after all the long delay. And as the distribution logic and the user-facing deployment workflow is quite different, users are already well-aware of what distribution backend they are intended to use I think. While it wouldn't be a BC change, I'd +1 on explicit declaration of distribution backend.

(I'm not claiming that this would be considered in this PR; just a feel-ya on @robertgshaw2-redhat 's comment above.)

sairampillai · 2025-09-29T14:13:00Z

I agree @robertgshaw2-redhat @cjackal, do you think we should close/merge this PR and then discuss with a wider forum to address the fallback scenario? Or should I go ahead and create a new PR for explicit declaration and early stopping?

jt-z · 2025-10-11T06:31:30Z

Hi @sairampillai, fantastic work tracking this bug down to the silent Ray fallback. Your detailed analysis in the PR description is a great example for the community.

I've been following the conversation and strongly agree with @robertgshaw2-redhat's suggestion to 'fail fast' by raising an error instead of just issuing a warning. This would prevent confusing timeouts and make the behavior much more robust, especially for users in constrained environments like K8s.

The implementation could be a direct change in vllm/config/parallel.py, something along these lines:

# In vllm/config/parallel.py, inside __post_init__
if (current_platform.is_cuda()
    and cuda_device_count_stateless() < self.world_size):
    gpu_count = cuda_device_count_stateless()
    raise ValueError(
        f"Tensor parallel size ({self.world_size}) cannot be larger than "
        f"the number of available GPUs ({gpu_count})."
    )

Let me know your thoughts. Happy to help in any way to get this important fix finalized and merged!

hmellor · 2025-10-13T16:54:48Z

Let's move forward with the fail fast approach

sairampillai · 2025-10-14T05:55:41Z

@hmellor got it! I will implement the fix and push the changes

Signed-off-by: Sairam Pillai <[email protected]>

sairampillai · 2025-10-27T13:45:27Z

@hmellor Updated to follow the discussion and fail fast

mergify · 2025-10-27T13:47:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sairampillai.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

hmellor · 2025-10-28T17:43:27Z

LGTM, please fix the conflicts and we should be able to merge

Signed-off-by: Sairam Pillai <[email protected]>

sairampillai · 2025-10-29T10:48:01Z

@hmellor Fixed conflicts and ready to merge

…lm-project#25775) Signed-off-by: Sairam Pillai <[email protected]>

…lm-project#25775) Signed-off-by: Sairam Pillai <[email protected]> Signed-off-by: Eldar Kurtic <[email protected]>

…lm-project#25775) Signed-off-by: Sairam Pillai <[email protected]>

sairampillai marked this pull request as ready for review September 26, 2025 19:32

sairampillai requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners September 26, 2025 19:32

Upgrade warning to error and fail fast

65e342f

Signed-off-by: Sairam Pillai <[email protected]>

mergify bot added the needs-rebase label Oct 27, 2025

hmellor approved these changes Oct 28, 2025

View reviewed changes

Merge main and fix conflicts

23ec7c7

Signed-off-by: Sairam Pillai <[email protected]>

mergify bot added the v1 label Oct 29, 2025

mergify bot removed the needs-rebase label Oct 29, 2025

hmellor enabled auto-merge (squash) October 29, 2025 12:43

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 29, 2025

hmellor merged commit 7437438 into vllm-project:main Oct 30, 2025
50 checks passed

MatthewBonanni pushed a commit to MatthewBonanni/vllm that referenced this pull request Oct 30, 2025

[Bugfix] Improve GPU validation logging in Ray fallback scenarios (vl…

1484d6a

…lm-project#25775) Signed-off-by: Sairam Pillai <[email protected]>

njhill mentioned this pull request Oct 30, 2025

[BugFix] Fix broken import in initialize_ray_cluster() #27838

Merged

ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025

[Bugfix] Improve GPU validation logging in Ray fallback scenarios (vl…

f5e5ffc

…lm-project#25775) Signed-off-by: Sairam Pillai <[email protected]>

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[Bugfix] Improve GPU validation logging in Ray fallback scenarios (vl…

c5f29f7

…lm-project#25775) Signed-off-by: Sairam Pillai <[email protected]>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Bugfix] Improve GPU validation logging in Ray fallback scenarios (vl…

5cef0f2

…lm-project#25775) Signed-off-by: Sairam Pillai <[email protected]>

eldarkurtic pushed a commit to eldarkurtic/vllm that referenced this pull request Nov 12, 2025

[Bugfix] Improve GPU validation logging in Ray fallback scenarios (vl…

59d45d9

…lm-project#25775) Signed-off-by: Sairam Pillai <[email protected]> Signed-off-by: Eldar Kurtic <[email protected]>

zaristei mentioned this pull request Nov 25, 2025

[Bug]: Multinode TP Errors when TP > number GPUS on 1 node, but <= world size #29447

Open

1 task

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Bugfix] Improve GPU validation logging in Ray fallback scenarios (vl…

0fdc188

…lm-project#25775) Signed-off-by: Sairam Pillai <[email protected]>

Uh oh!

[Bugfix] Improve GPU validation logging in Ray fallback scenarios #25775

[Bugfix] Improve GPU validation logging in Ray fallback scenarios #25775

Uh oh!

Conversation

sairampillai commented Sep 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[Bugfix] Improve GPU validation logging in Ray fallback scenarios

Related Issues

Purpose

Changes Made

Files Modified

Test Plan

Scenario Testing

Test Commands

Functional Testing

Test Result

Before Fix

After Fix

Validation Results

Uh oh!

github-actions bot commented Sep 26, 2025

Uh oh!

robertgshaw2-redhat commented Sep 26, 2025

Uh oh!

cjackal commented Sep 27, 2025

Uh oh!

sairampillai commented Sep 29, 2025

Uh oh!

jt-z commented Oct 11, 2025

Uh oh!

hmellor commented Oct 13, 2025

Uh oh!

sairampillai commented Oct 14, 2025

Uh oh!

sairampillai commented Oct 27, 2025

Uh oh!

mergify bot commented Oct 27, 2025

Uh oh!

hmellor commented Oct 28, 2025

Uh oh!

sairampillai commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sairampillai commented Sep 26, 2025 •

edited by github-actions bot

Loading