-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support and Prefill support #24845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support and Prefill support #24845
Conversation
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
…inson/attn-slicing
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
…pping asserts Signed-off-by: Sage Moore <[email protected]>
…sult in an empty second ubatch Signed-off-by: Sage Moore <[email protected]>
…inson/attn-slicing
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
…inson/attn-slicing
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work! A few more thoughts
allow_microbatching_options = [True, False] if \ | ||
capture_ubatched_graph else [False] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we simply use two bools?
force_attention: bool = False, | ||
uniform_decode: bool = False, | ||
allow_microbatching: bool = False, | ||
allow_microbatching: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, would the param renamed to microbatching_fallback
etc be better? (just feeling allow_microbatching
doesn't show the idea you mention)
Or we can have detailed comments
pass | ||
|
||
def max_sms_used(self) -> Optional[int]: | ||
return None # None means it could use the whole GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would -1 be better?
parallel_group.add_argument( | ||
"--dbo-prefill-token-threshold", | ||
**parallel_kwargs["dbo_prefill_token_threshold"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments for the new arg
if hook is not None: | ||
if dbo_enabled(): | ||
# If DBO is being used, register the hook with the ubatch | ||
# context and call it in dbo_maybe_run_recv_hook instead of | ||
# passing it to the receiver. | ||
dbo_register_recv_hook(hook) | ||
dbo_yield() | ||
else: | ||
hook() | ||
|
||
receiver() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we have this logic for two times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factor it out into a function, since it appears twice?
dbo_yield() | ||
else: | ||
hook() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here once again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py
Outdated
Show resolved
Hide resolved
if hook is not None: | ||
if dbo_enabled(): | ||
# If DBO is being used, register the hook with the ubatch | ||
# context and call it in dbo_maybe_run_recv_hook instead of | ||
# passing it to the receiver. | ||
dbo_register_recv_hook(hook) | ||
dbo_yield() | ||
else: | ||
hook() | ||
|
||
receiver() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factor it out into a function, since it appears twice?
…e.py Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
@LucasWilkinson this commit introduces weired behaviour - the first http request with larger context is working normally but the subsequent requests are signifficantly slower. I have verified that it is this commit: cc1dc7e which should be this PR. a903669 is working normally
|
… and Prefill support (vllm-project#24845) Signed-off-by: Sage Moore <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>
… and Prefill support (#24845) Signed-off-by: Sage Moore <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: yewentao256 <[email protected]>
|
||
if not should_ubatch: | ||
num_pad, num_tokens_across_dp = self.get_dp_padding(num_tokens) | ||
num_tokens += num_pad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this doesn't make the padding happen.
… and Prefill support (vllm-project#24845) Signed-off-by: Sage Moore <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: gaojc <[email protected]>
… and Prefill support (vllm-project#24845) Signed-off-by: Sage Moore <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
… and Prefill support (vllm-project#24845) Signed-off-by: Sage Moore <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>
Purpose
Test Plan
lm_eval
Test Result
export VLLM_ALL2ALL_BACKEND=deepep_high_throughput
export VLLM_ALL2ALL_BACKEND=deepep_low_latency
HT Overlap Trace (2x8xH100)

Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.