-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Misc] Make timeout passable in init_distributed_environment #24522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a configurable timeout for init_distributed_environment, which is a useful addition for scenarios requiring longer initialization times. However, the current implementation has a potential issue where passing the default None value for the timeout to torch.distributed.init_process_group could lead to a runtime error. My review includes suggestions to fix this by conditionally passing the timeout argument, ensuring that the default PyTorch timeout is used when no specific timeout is provided.
vllm/distributed/parallel_state.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm/distributed/parallel_state.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[timedelta] as already stated by gemini - otherwise lgtm
f39eb43 to
3048d38
Compare
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
3048d38 to
6574d10
Compare
1aea182 to
b43d7bc
Compare
b43d7bc to
e680723
Compare
abd5623 to
e6d6c51
Compare
Signed-off-by: jberkhahn <[email protected]>
e6d6c51 to
64a8cc7
Compare
…oject#24522) Signed-off-by: jberkhahn <[email protected]>
…oject#24522) Signed-off-by: jberkhahn <[email protected]>
…oject#24522) Signed-off-by: jberkhahn <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…oject#24522) Signed-off-by: jberkhahn <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Purpose
vllm-spyre has been experiencing timeouts on this function in certain scenarios when forcing model compilation to happen serially with multiple backends and large context lengths. This function already has an option to set a timeout, it's just not passed here. This PR makes it configurable, but leaves it as the default (which is 30 minutes).