-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
[core] Multi Step Scheduling #7000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
QQ: do you plan to split PRs to smaller pieces? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first batch of comments.
e1f1f19
to
d820deb
Compare
40d5e5f
to
5789d64
Compare
7771e1c
to
c1b0e0a
Compare
@zhuohan123 @rkooo567 @Yard1 @comaniac @alexm-neuralmagic rebased and ready for review |
119b4de
to
bda8e68
Compare
Working on a smaller PR that contains parts of this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of questions. Will add more tmrw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second batch of reviews
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this assert mean the MultiStepModelRunner
can only be run with one step? Can you elaborate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MultiStepModelRunner
only takes a single step internally before returning to AsyncLLMEngine. As the multi-step is done implicitly using stateful model inputs and SequenceGroup states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the explaination!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing tho. IIRC, this was introduced by me for multi-step draft model runner? We should remove this argument and use stateful model inputs as the unify representation. Also cc @alexm-neuralmagic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah let's remove this argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do this in a follow up PR as it will involve spec decode as well. Will add to TODO tracker
32072f7
to
67c1907
Compare
tests/multi_step/test_correctness.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SolitaryThinker no need to block on this feedback - but if you have time - I would propose adding an example/offline_inference_multi_step.py example which instantiates an engine instance with multi-step enabled. Similar in structure to example/offline_inference.py.
An example of why this is useful - as part of the logprobs workstream, I am trying to step through the multi-step model runner with the python debugger & examine the output logprobs. I am using your multi_step/test_correctness.py in order to set up a server with multi-step enabled.
However, multi_step/test_correctness.py is an end-to-end client/server test & it is not straightforward (although technically doable) to step through the server code with the debugger because the server is in another process.
I will get around this by writing a short script which sets up an engine instance with multi-step enabled.
However, for someone else who is approaching this code for the first time, it could be helpful to have an example file (or unit test) which just sets up an engine instance with multi-step enabled and invokes inference using LLM.generate(). This could be a good way to facilitate quick debugging & also gives insight into how the server works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the offline_inference_multi_step.py script I wrote for myself to facilitate debugging, if you would like to use it.
'''
Example of setting up LLM with multi-step enabled.
In actuality, async engine would be a more sensible choice
for a real use-case. However this example is useful
for demonstration & debugging of multi-step code.
'''
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="JackFram/llama-160m",
swap_space=16,
tensor_parallel_size=1,
gpu_memory_utilization=0.9,
num_scheduler_steps=8,
use_v2_block_manager=True,
)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
79e9b54
to
2a07b6c
Compare
2a07b6c
to
9f4cf17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder if there's a more generic way of doing this. If this data structure gets modified somewhere else it will not be reflected here. Maybe a loop where we check the device if the object is a tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are optionals and only set if include_gpu_probs_tensor
is set in the sampler.
remove some redundant test cases set v2 blockmananger and fix rebase Update vllm/engine/async_llm_engine.py Co-authored-by: Zhuohan Li <[email protected]> Update vllm/engine/async_llm_engine.py Co-authored-by: Zhuohan Li <[email protected]> Update vllm/worker/multi_step_model_runner.py Co-authored-by: Zhuohan Li <[email protected]> add comment typo rename to StatefulModelInput renamed outputs to cached_outputs Update vllm/worker/multi_step_model_runner.py Co-authored-by: afeldman-nm <[email protected]>
a14fbab
to
5fac4a1
Compare
@zhuohan123 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the hard work! Please make sure to keep track of the TODOs we discussed in this PR.
@SolitaryThinker Huge thanks for the PR! QQ: I got the above error when running benchmark scripts with num_scheduler_steps > 1. Is this expected? |
Hi @WoosukKwon . I see spec decode also has a class name MultiStepWorker, is there any relation with |
Co-authored-by: afeldman-nm <[email protected]> Signed-off-by: Alvant <[email protected]>
Co-authored-by: afeldman-nm <[email protected]> Signed-off-by: LeiWang1999 <[email protected]>
Hello:) Sorry to bother, what is the state of this? Is it planned for implementation in new v1 version? Thanks a lot! |
My understanding is that this shouldn't be needed in v1, as this was a stopgap for the performance bottlenecks of the scheduler in v0. Someone more up-to-date can correct me though :) |
Adds initial multi step scheduling support to vLLM.
RFC: #6854
Current Status:
8/16: Initial support for chunked prefill thanks to @varun-sundar-rabindranath
8/14: Ready for another round of reviews!
please review #74528/8: multi-node working
8/6: PP+TP working; PP+ray fixed;
a few single GPU perf regressions (easy fix)8/2 PP works with MP; Ready for initial pass on design
8/1 - PP is very close to working. We do get the desired interleaving of steps between microbatches which is great!
7/31 - Current branch is in very rough shape after getting the RFC design working. Will clean up after adding TP/PP support as there may be some refactors needed. However single GPU is ready for initial testing
Cmd:
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B --swap-space 16 --disable-log-requests --use-v2-block-manager --tensor-parallel-size 1 --worker-use-ray --pipeline-parallel-size 1 --gpu-memory-utilization 0.90 --num-scheduler-steps 8
Benchmark (8/16)
See: #7528
CP_1: Force Single Step: We force single step when there are prefill requests in a batch. This may work well for offline batching, but not good for online serving because new requests keep coming.
CP_2: Ignore Prefill (WIP): We ignore prefill requests since the second step, meaning that prefill requests do nothing in (k-1) steps. This may work better for online serving.
A10G 8B Llama (microbatch=128)H100 8B LlamaH100 70B LlamaA10G 8B LlamaA10G 8B Llama (microbatch=128)TODO:
Milestone 1: POC
--max_forward_calls_per_step
to cli argument, engine args, and schedulerConfigSequenceGroupState
insequence.py
to track multi-step state.MultiStepWorker
inworker/
to cache multi-step stateModelRunner
to handle multi step stateModelRunner
to reduce duplicate codemem leak somewhere with RAY)Milstone 2: Mergeable
num_scheduler_steps
model_runner.py
, perhapsmulti_step_model_runner.py
?Follow up work: Tracking Issue #7528
_pythonize_sampler_output