[Doc] Update docs on handling OOM #15357

DarkLight1337 · 2025-03-23T13:53:14Z

Specifically ask users to set --max-num-seqs to avoid OOM in V1
Bring up more options (VLLM_MM_INPUT_CACHE_GIB and VLLM_CPU_KVCACHE_SPACE) to reduce CPU memory consumption.
Reduce VLLM_MM_INPUT_CACHE_GIB default to 4 (previous 8) as users with 32GB RAM may otherwise run out of memory.
Misc: Update Engine Arguments page to point back to offline inference and online serving pages for easy reference.

Signed-off-by: DarkLight1337 <[email protected]>

github-actions · 2025-03-23T13:53:24Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-03-23T13:59:06Z

cc @robertgshaw2-redhat should we reduce the default max_num_seqs for V1, or adjust it automatically somehow? There have been numerous reports of OOM from people using lower-end GPUs like RTX3060.

robertgshaw2-redhat · 2025-03-23T21:01:44Z

cc @robertgshaw2-redhat should we reduce the default max_num_seqs for V1, or adjust it automatically somehow? There have been numerous reports of OOM from people using lower-end GPUs like RTX3060.

Why does --max-num-seqs result in OOM?

ywang96 · 2025-03-24T00:09:11Z

cc @robertgshaw2-redhat should we reduce the default max_num_seqs for V1, or adjust it automatically somehow? There have been numerous reports of OOM from people using lower-end GPUs like RTX3060.

Why does --max-num-seqs result in OOM?

@robertgshaw2-redhat This is mostly related to two changes we made on V1

max-num-seqs was raised from 256 to 1024 by default

On V1 there were two PRs added late in the development ([Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler #13594, [V1][Core] Fix memory issue with logits & sampling #14508) to add sampler dummy run into profile_run and compile_or_warm_up_model (the latter is to mitigate memory fragmentation issue which this was previously missing on V0). This means there will always be a dummy sampler run with max number of possible decoding sequences (which's very often --max-num-seqs), and can sometimes result in OOM becasue of the new default. We have also gave an explicit error about it here but users can sometimes miss it.

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 1353 to 1357 in dccf535

    
           raise RuntimeError( 
        
               "CUDA out of memory occurred when warming up sampler with " 
        
               f"{num_reqs} dummy requests. Please try lowering " 
        
               "`max_num_seqs` or `gpu_memory_utilization` when " 
        
               "initializing the engine.") from e

robertgshaw2-redhat · 2025-03-24T00:41:27Z

cc @robertgshaw2-redhat should we reduce the default max_num_seqs for V1, or adjust it automatically somehow? There have been numerous reports of OOM from people using lower-end GPUs like RTX3060.

Why does --max-num-seqs result in OOM?

@robertgshaw2-redhat This is mostly related to two changes we made on V1

max-num-seqs was raised from 256 to 1024 by default

On V1 there were two PRs added late in the development ([Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler #13594, [V1][Core] Fix memory issue with logits & sampling #14508) to add sampler dummy run into profile_run and compile_or_warm_up_model (the latter is to mitigate memory fragmentation issue which this was previously missing on V0). This means there will always be a dummy sampler run with max number of possible decoding sequences (which's very often --max-num-seqs), and can sometimes result in OOM becasue of the new default. We have also gave an explicit error about it here but users can sometimes miss it.

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 1353 to 1357 in dccf535

raise RuntimeError(

"CUDA out of memory occurred when warming up sampler with "

f"{num_reqs} dummy requests. Please try lowering "

"`max_num_seqs` or `gpu_memory_utilization` when "

"initializing the engine.") from e

Okay, we can set --max-num-seqs 1024 just for H100 and A100 then. WDYT @WoosukKwon

houseroad

Feel it's safer to leave the default unchanged if it's already used in production.

docs/source/getting_started/installation/cpu.md

vllm/envs.py

ywang96 · 2025-03-24T08:25:00Z

docs/source/serving/offline_inference.md

+The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1.
+If you encounter OOM only when using V1 engine, try setting a lower value of `max_num_seqs`.


Suggested change

The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1.

If you encounter OOM only when using V1 engine, try setting a lower value of `max_num_seqs`.

The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1.

If you encounter CUDA OOM only when using V1 engine, try setting a lower value of `max_num_seqs` or `gpu_memory_utilization`.

I am worried this might lead to some confusion as lowering gpu_memory_utilization may lead to a related error "No available memory for the cache blocks"

Let’s still clarify this is related to CUDA OOM

Updated in 4c27e09

docs/source/serving/offline_inference.md

ywang96 · 2025-03-24T15:44:25Z

@simon-mo What's your thought on this one?

ywang96 · 2025-03-24T15:48:18Z

docs/source/serving/offline_inference.md

+:::{important}
+The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1.
+If you encounter CUDA OOM only when using V1 engine, try setting a lower value of `max_num_seqs` or `gpu_memory_utilization`.
+On the other hand, if you get an error about insufficient memory for the cache blocks, you should increase `gpu_memory_utilization` as this indicates that your GPU has sufficient memory but you're not allocating enough of it to vLLM.
+:::


Now I think about it - maybe we move this to V1 User Guide page?

I moved this under FAQ for V1

Signed-off-by: Roger Wang <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Wes Medford <[email protected]>

GohioAC · 2025-03-27T05:42:44Z

The OOM issue is not limited to the dummy sampler run with max number of possible decoding sequences.

I'm running some 1B multimodal models with 32 max-num-seqs, and the CPU RAM usage increases after every batch. I have tried deleting previous batch inputs and garbage collecting to no avail. Even setting disable_mm_preprocessor_cache to True does not help.

Let me know, if I should create a dedicated issue with more details. This issue is specific to the V1 engine.

DarkLight1337 · 2025-03-27T05:43:55Z

Which version of vLLM are you using? Both v0.8.1 and v0.8.2 fixed some memory leaks.

GohioAC · 2025-03-27T05:46:20Z

Just saw #15294. Looks like the same issue.
My bad for not searching thoroughly. I'll give v0.8.2 a whirl.

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Mu Huai <[email protected]>

[Doc] Update docs on handling OOM

9f6d011

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 23, 2025

DarkLight1337 requested review from youkaichao and ywang96 March 23, 2025 13:53

mergify bot added the documentation Improvements or additions to documentation label Mar 23, 2025

houseroad reviewed Mar 24, 2025

View reviewed changes

docs/source/getting_started/installation/cpu.md Show resolved Hide resolved

vllm/envs.py Show resolved Hide resolved

houseroad approved these changes Mar 24, 2025

View reviewed changes

ywang96 reviewed Mar 24, 2025

View reviewed changes

DarkLight1337 commented Mar 24, 2025

View reviewed changes

docs/source/serving/offline_inference.md Outdated Show resolved Hide resolved

Update docs/source/serving/offline_inference.md

4c27e09

ywang96 reviewed Mar 24, 2025

View reviewed changes

move

fb55102

Signed-off-by: Roger Wang <[email protected]>

ywang96 approved these changes Mar 24, 2025

View reviewed changes

simon-mo enabled auto-merge (squash) March 24, 2025 16:58

clarify

b92b149

Signed-off-by: Roger Wang <[email protected]>

simon-mo modified the milestones: v0.8.0, v0.8.2 Mar 24, 2025

update

e1a7cce

Signed-off-by: Roger Wang <[email protected]>

simon-mo disabled auto-merge March 24, 2025 21:29

simon-mo merged commit 6dd55af into vllm-project:main Mar 24, 2025
26 of 33 checks passed

DarkLight1337 deleted the docs-oom branch March 25, 2025 03:58

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Doc] Update docs on handling OOM (vllm-project#15357)

7bfba45

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]>

		The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1.
		If you encounter OOM only when using V1 engine, try setting a lower value of `max_num_seqs`.

Uh oh!

[Doc] Update docs on handling OOM #15357

[Doc] Update docs on handling OOM #15357

Uh oh!

Conversation

DarkLight1337 commented Mar 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 23, 2025

Uh oh!

DarkLight1337 commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat commented Mar 23, 2025

Uh oh!

ywang96 commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat commented Mar 24, 2025

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ywang96 Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ywang96 commented Mar 24, 2025

Uh oh!

ywang96 Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

GohioAC commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Mar 27, 2025

Uh oh!

GohioAC commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

DarkLight1337 commented Mar 23, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Mar 23, 2025 •

edited

Loading

ywang96 commented Mar 24, 2025 •

edited

Loading

GohioAC commented Mar 27, 2025 •

edited

Loading