Skip to content

Conversation

esmeetu
Copy link
Member

@esmeetu esmeetu commented Feb 26, 2024

Currently vLLM inference on multi-node is broken after #2811 introduced cupy.
This PR support multi-node inference for eager mode.
Likely temporarily fix #2826 #2959 with eager mode.

@esmeetu esmeetu requested a review from WoosukKwon February 26, 2024 13:58
@WoosukKwon
Copy link
Collaborator

I think the right way to do this is to correctly set up CuPy in multi-node setting? WDYT?

@Yard1
Copy link
Collaborator

Yard1 commented Feb 26, 2024

Considering CuPy is supposed to just be a workaround for now, I think it makes more sense to avoid using it unless necessary (and it should not be necessary in eager mode).

That being said, it does have to work in multi-node setting to enable CUDA graphs there. It should be relatively straightforward to set it up.

@esmeetu
Copy link
Member Author

esmeetu commented Feb 27, 2024

I think the right way to do this is to correctly set up CuPy in multi-node setting? WDYT?

@WoosukKwon I understand in #2811 that you just fix memory leak issue on cuda graph by introducing CuPy. And eager mode doesn't have that issue, so we can keep it as it is before we found some benefit.

@WoosukKwon
Copy link
Collaborator

@esmeetu Yep. Makes sense! Let's merge this PR and figure out how to correctly set up CuPy in the multi-node setting.

@WoosukKwon WoosukKwon merged commit c1c0d00 into vllm-project:main Feb 27, 2024
@esmeetu esmeetu deleted the fix-eager branch March 1, 2024 14:00
xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vLLM running on a Ray Cluster Hanging on Initializing
3 participants