Skip to content

Conversation

@celestialli
Copy link
Contributor

@celestialli celestialli commented Apr 14, 2025

What this PR does / why we need it?

This PR adds sleep mode feature for vllm-ascend, when sleeps, we do mainly two things:

  • offload model weights
  • discard kv cache

RLHF tools(such as https://github.com/volcengine/verl and https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode to accelerate the training process.

This PR may solve #375 and #320 .

Does this PR introduce any user-facing change?

No existing user interfaces changed.
Users will have two new methods(sleep() and wake_up()) to use.

How was this patch tested?

This PR is tested with Qwen/Qwen2.5-0.5B-Instruct.

At first, we have free NPU memory M1.

After llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True) executed, we have free NPU memory M2. M2 < M1.

Then we call llm.sleep(level=1), we have free NPU memory M3.

We have M3 > M2, M3 is very close to M1.

Plus, we have the same output tokens before sleep and after wake up, with the config of SamplingParams(temperature=0, max_tokens=10) and with the same input tokens of course.

This PR is utilizing the CMake procedure of #371 , thanks a lot.
Related: vllm-project/vllm#16562

@github-actions github-actions bot added module:tests module:core documentation Improvements or additions to documentation and removed documentation Improvements or additions to documentation labels Apr 14, 2025
Signed-off-by: Shuqiao Li <[email protected]>
@celestialli celestialli changed the title [WIP] Add sleep mode feature for Ascend NPU Add sleep mode feature for Ascend NPU Apr 18, 2025
@wangxiyuan
Copy link
Collaborator

wangxiyuan commented Apr 18, 2025

LGTM. sleep mode feature is mainly reviewed via 0.7.3 branch. let's merge this quickly first.

@wangxiyuan wangxiyuan merged commit 84563fc into vllm-project:main Apr 18, 2025
15 checks passed
@celestialli celestialli deleted the sleepmode branch April 21, 2025 09:01
ttanzhiqiang pushed a commit to ttanzhiqiang/vllm-ascend that referenced this pull request Apr 27, 2025
### What this PR does / why we need it?
This PR adds sleep mode feature for vllm-ascend, when sleeps, we do
mainly two things:

- offload model weights
- discard kv cache

RLHF tools(such as https://github.com/volcengine/verl and
https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode
to accelerate the training process.

This PR may solve vllm-project#375 and vllm-project#320 .

### Does this PR introduce _any_ user-facing change?
No existing user interfaces changed.
Users will have two new methods(`sleep()` and `wake_up()`) to use.

### How was this patch tested?
This PR is tested with Qwen/Qwen2.5-0.5B-Instruct.

At first, we have free NPU memory M1.

After `llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)`
executed, we have free NPU memory M2. M2 < M1.

Then we call `llm.sleep(level=1)`, we have free NPU memory M3.

We have M3 > M2, M3 is very close to M1.

Plus, we have the same output tokens before sleep and after wake up,
with the config of `SamplingParams(temperature=0, max_tokens=10)` and
with the same input tokens of course.


This PR is utilizing the CMake procedure of vllm-project#371 , thanks a lot.

Signed-off-by: Shuqiao Li <[email protected]>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
This PR adds sleep mode feature for vllm-ascend, when sleeps, we do
mainly two things:

- offload model weights
- discard kv cache

RLHF tools(such as https://github.com/volcengine/verl and
https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode
to accelerate the training process.

This PR may solve vllm-project#375 and vllm-project#320 .

### Does this PR introduce _any_ user-facing change?
No existing user interfaces changed.
Users will have two new methods(`sleep()` and `wake_up()`) to use.

### How was this patch tested?
This PR is tested with Qwen/Qwen2.5-0.5B-Instruct.

At first, we have free NPU memory M1.

After `llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)`
executed, we have free NPU memory M2. M2 < M1.

Then we call `llm.sleep(level=1)`, we have free NPU memory M3.

We have M3 > M2, M3 is very close to M1.

Plus, we have the same output tokens before sleep and after wake up,
with the config of `SamplingParams(temperature=0, max_tokens=10)` and
with the same input tokens of course.


This PR is utilizing the CMake procedure of vllm-project#371 , thanks a lot.

Signed-off-by: Shuqiao Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants