-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Doc] Update docs on handling OOM #15357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: DarkLight1337 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
cc @robertgshaw2-redhat should we reduce the default |
Why does |
@robertgshaw2-redhat This is mostly related to two changes we made on V1
|
Okay, we can set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel it's safer to leave the default unchanged if it's already used in production.
The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1. | ||
If you encounter OOM only when using V1 engine, try setting a lower value of `max_num_seqs`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1. | |
If you encounter OOM only when using V1 engine, try setting a lower value of `max_num_seqs`. | |
The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1. | |
If you encounter CUDA OOM only when using V1 engine, try setting a lower value of `max_num_seqs` or `gpu_memory_utilization`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried this might lead to some confusion as lowering gpu_memory_utilization
may lead to a related error "No available memory for the cache blocks"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s still clarify this is related to CUDA OOM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 4c27e09
@simon-mo What's your thought on this one? |
:::{important} | ||
The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1. | ||
If you encounter CUDA OOM only when using V1 engine, try setting a lower value of `max_num_seqs` or `gpu_memory_utilization`. | ||
On the other hand, if you get an error about insufficient memory for the cache blocks, you should increase `gpu_memory_utilization` as this indicates that your GPU has sufficient memory but you're not allocating enough of it to vLLM. | ||
::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I think about it - maybe we move this to V1 User Guide page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this under FAQ for V1
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Wes Medford <[email protected]>
The OOM issue is not limited to the dummy sampler run with max number of possible decoding sequences. I'm running some 1B multimodal models with 32 Let me know, if I should create a dedicated issue with more details. This issue is specific to the V1 engine. |
Which version of vLLM are you using? Both v0.8.1 and v0.8.2 fixed some memory leaks. |
Just saw #15294. Looks like the same issue. |
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Mu Huai <[email protected]>
--max-num-seqs
to avoid OOM in V1VLLM_MM_INPUT_CACHE_GIB
andVLLM_CPU_KVCACHE_SPACE
) to reduce CPU memory consumption.VLLM_MM_INPUT_CACHE_GIB
default to 4 (previous 8) as users with 32GB RAM may otherwise run out of memory.