Skip to content

Conversation

@sairampillai
Copy link
Contributor

@sairampillai sairampillai commented Sep 26, 2025

[Bugfix] Improve GPU validation logging in Ray fallback scenarios

Adds early GPU count validation and clearer Ray placement error messages when tensor_parallel_size exceeds available GPUs to address poor logging and help users diagnose K8s deployment failures.

Related Issues

Fixes #25263

Purpose

Fixes poor logging when tensor_parallel_size exceeds available GPUs in Ray fallback scenarios.

When tensor_parallel_size is set higher than the available GPU count (e.g., tensor_parallel_size=4 with only 1 GPU), vLLM silently falls back to Ray executor without adequate warning. This causes confusing error messages in K8s deployments, where users see Ray placement group timeout errors without understanding the root cause.

Changes Made

  1. Early GPU validation in vllm/config/parallel.py: Added warning when tensor parallel size exceeds available GPUs during backend selection
  2. Enhanced Ray placement error messages in vllm/executor/ray_utils.py: Improved error messages in _wait_until_pg_ready() and initialize_ray_cluster() functions to provide context about GPU resource mismatches

Files Modified

  • vllm/config/parallel.py - Added GPU count validation with clear warnings
  • vllm/executor/ray_utils.py - Enhanced Ray placement group error handling

Test Plan

Scenario Testing

  1. Single GPU scenario with multi-GPU tensor parallel: Test with --tensor-parallel-size 4 on a system with only 1 available GPU
  2. K8s GPU resource mismatch: Verify error messages in constrained K8s environments where pod requests only 1 GPU but tensor parallel size > 1
  3. Normal operation: Ensure no impact when GPU resources match tensor parallel requirements

Test Commands

# Test 1: Check warning when tensor_parallel_size > available GPUs
python -c "
from vllm.config.parallel import ParallelConfig
from vllm.logger import init_logger
import logging
logging.basicConfig(level=logging.WARNING)
config = ParallelConfig(tensor_parallel_size=4)
print('Config test completed')
"

# Test 2: Ray integration test (requires multi-GPU setup)
PYTHONPATH=. python examples/offline_inference.py \
  --model microsoft/DialoGPT-small \
  --prompt "Hello world" \
  --tensor-parallel-size 2  # Will trigger validation if only 1 GPU

Functional Testing

  • Verify warning messages appear at correct configuration stages
  • Ensure normal operation remains unaffected with properly configured GPU resources
  • Test Ray cluster initialization warning when GPU mismatch detected

Test Result

Before Fix

  • No early warning when tensor_parallel_size exceeds available GPUs
  • Cryptic Ray placement group timeout errors:
    ValueError: Cannot provide a placement group of 'placement_group_specs=...' within 2550 seconds
    

After Fix

  • Early warning during configuration:
    WARNING: Tensor parallel size (4) exceeds available GPUs (1). This will likely cause issues. Consider reducing tensor_parallel_size to 1 or less...
    
  • Enhanced Ray placement error with actionable guidance:
    ValueError: Cannot provide a placement group requiring 4 GPUs (...) within 2550 seconds.
    Tensor parallel size may exceed available GPUs in your cluster. Check resources with `ray status` and `ray list nodes`.
    If running on K8s with limited GPUs, consider reducing --tensor-parallel-size to match available GPU resources.
    

Validation Results

  • Code quality checks passed: pre-commit hooks, format checks, lint checks
  • Backward compatibility preserved: No breaking changes to existing behavior
  • Enhanced user experience: Clear error messages guide users to resolution
  • K8s scenario targeted: Specific guidance for Kubernetes deployment issues

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Adds early GPU count validation and clearer Ray placement error messages
when tensor_parallel_size exceeds available GPUs to address poor
logging and help users diagnose K8s deployment failures.

Signed-off-by: Sairam Pillai <[email protected]>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@robertgshaw2-redhat
Copy link
Collaborator

instead of improving the log, can we just not allow vllm to start? I don't quite understand why the behavior of falling back to ray is something that is needed

@cjackal
Copy link
Contributor

cjackal commented Sep 27, 2025

I also do think that this silent fallback behavior is not only confusing but also pretty dangerous in the sense that the model server maintainer's small typo in deployment configuration results in complete failure after all the long delay. And as the distribution logic and the user-facing deployment workflow is quite different, users are already well-aware of what distribution backend they are intended to use I think. While it wouldn't be a BC change, I'd +1 on explicit declaration of distribution backend.

(I'm not claiming that this would be considered in this PR; just a feel-ya on @robertgshaw2-redhat 's comment above.)

@sairampillai
Copy link
Contributor Author

I agree @robertgshaw2-redhat @cjackal, do you think we should close/merge this PR and then discuss with a wider forum to address the fallback scenario? Or should I go ahead and create a new PR for explicit declaration and early stopping?

@jt-z
Copy link

jt-z commented Oct 11, 2025

Hi @sairampillai, fantastic work tracking this bug down to the silent Ray fallback. Your detailed analysis in the PR description is a great example for the community.

I've been following the conversation and strongly agree with @robertgshaw2-redhat's suggestion to 'fail fast' by raising an error instead of just issuing a warning. This would prevent confusing timeouts and make the behavior much more robust, especially for users in constrained environments like K8s.

The implementation could be a direct change in vllm/config/parallel.py, something along these lines:

# In vllm/config/parallel.py, inside __post_init__
if (current_platform.is_cuda()
    and cuda_device_count_stateless() < self.world_size):
    gpu_count = cuda_device_count_stateless()
    raise ValueError(
        f"Tensor parallel size ({self.world_size}) cannot be larger than "
        f"the number of available GPUs ({gpu_count})."
    )

Let me know your thoughts. Happy to help in any way to get this important fix finalized and merged!

@hmellor
Copy link
Member

hmellor commented Oct 13, 2025

Let's move forward with the fail fast approach

@sairampillai
Copy link
Contributor Author

@hmellor got it! I will implement the fix and push the changes

@sairampillai
Copy link
Contributor Author

@hmellor Updated to follow the discussion and fail fast

@mergify
Copy link

mergify bot commented Oct 27, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sairampillai.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 27, 2025
@hmellor
Copy link
Member

hmellor commented Oct 28, 2025

LGTM, please fix the conflicts and we should be able to merge

Signed-off-by: Sairam Pillai <[email protected]>
@mergify mergify bot added the v1 label Oct 29, 2025
@sairampillai
Copy link
Contributor Author

@hmellor Fixed conflicts and ready to merge

@mergify mergify bot removed the needs-rebase label Oct 29, 2025
@hmellor hmellor enabled auto-merge (squash) October 29, 2025 12:43
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 29, 2025
@hmellor hmellor merged commit 7437438 into vllm-project:main Oct 30, 2025
50 checks passed
MatthewBonanni pushed a commit to MatthewBonanni/vllm that referenced this pull request Oct 30, 2025
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
eldarkurtic pushed a commit to eldarkurtic/vllm that referenced this pull request Nov 12, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Poor logging on not enough GPUs for vLLM pod

5 participants