Skip to content

Conversation

@jberkhahn
Copy link
Contributor

@jberkhahn jberkhahn commented Sep 9, 2025

Purpose

vllm-spyre has been experiencing timeouts on this function in certain scenarios when forcing model compilation to happen serially with multiple backends and large context lengths. This function already has an option to set a timeout, it's just not passed here. This PR makes it configurable, but leaves it as the default (which is 30 minutes).

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a configurable timeout for init_distributed_environment, which is a useful addition for scenarios requiring longer initialization times. However, the current implementation has a potential issue where passing the default None value for the timeout to torch.distributed.init_process_group could lead to a runtime error. My review includes suggestions to fix this by conditionally passing the timeout argument, ensuring that the default PyTorch timeout is used when no specific timeout is provided.

Comment on lines 988 to 987
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The type hint for timeout should be Optional[timedelta] to correctly reflect that None is a possible value. This improves code clarity and helps static type checkers.

Suggested change
timeout: timedelta = None,
):
timeout: Optional[timedelta] = None,
):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[timedelta] as already stated by gemini - otherwise lgtm

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 9, 2025
@njhill njhill changed the title Make timeout passable in init_distributed_envionment [Misc] Make timeout passable in init_distributed_environment Sep 9, 2025
@jberkhahn jberkhahn force-pushed the timeout_fix branch 2 times, most recently from 1aea182 to b43d7bc Compare September 9, 2025 21:46
@jberkhahn jberkhahn closed this Sep 10, 2025
@jberkhahn jberkhahn reopened this Sep 10, 2025
@jberkhahn jberkhahn force-pushed the timeout_fix branch 2 times, most recently from abd5623 to e6d6c51 Compare September 10, 2025 20:27
@simon-mo simon-mo merged commit cc99baf into vllm-project:main Sep 10, 2025
38 of 42 checks passed
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants