-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[V1][Core] Fix memory issue with logits & sampling #14508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Roger Wang <[email protected]>
@varun-sundar-rabindranath @jeejeelee Please help take a look why this is breaking LoRA tests on V1 - thank you very much! 🙏 |
Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
# NOTE: In V1, the memory buffer for logits (max_num_reqs x vocab_size) | ||
# is captured but cannot be releasesd from PyTorch due to a known bug, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the discussion here https://vllm-dev.slack.com/archives/C087WBWC5AQ/p1741398800083509?thread_ts=1741386694.452939&cid=C087WBWC5AQ - TLDR is that empty_cache cannot be called when we turn on sleep mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Why do we need empty_cache
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference here is that we never (in both V0 and V1) warmed up sampler, therefore the memory fragmentation issue was always there but not as pronounced in V0 (since the default batch size is 256).
Now we're adding the sampler warmup in V1, but when we call sleep()
, the memory buffer for logits can't be cleared from the pytorch caching allocator (the bug mentioned in this comment), therefore the memory usage will be a lot higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywang96 Thanks for the explanation. Just want to double check: We don't want to call empty_cache
anyways, because we intentionally reserve the (max_num_reqs x vocab_size)
-sized tensor in the pytorch allocator, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct though I do think there should be a better & clean fix for this to work with sleep mode in the long term. We should probably free the memory when sleep
is called, then warm up sampler again within wakeup
, but this is currently blocked since we can't free the memory anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm,,, How is the logits tensor different from other intermediate activation tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this specific tensor becomes a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because dummy_run
doesn't include/activate sampler tensors, this is why we made dummy_sampler_run
in the first place.
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Reopened from reverted #13776
Co-authored by @varun-sundar-rabindranath for LoRA dummy run fix.