-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Bugfix] Handle best_of>1
case by disabling speculation.
#6138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Thanks for the fix -- approach looks good to me. On whether or not we should support this -- for performance we would want to disable this feature or support it natively in spec decode. I am fine having this in, can we log once if this happens so there's a hint of the performance degredation to users? |
|
||
assert seqs, "expected running sequences" | ||
|
||
if len(seqs) > 1 or sequence_group.sampling_params.best_of > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment here explaining what's going on. also can we explicitly fail if beam search is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
re: failing on beam search, I think this is best done before we put the requests in the batch so I've added some code in add_request
to that effect.
for seq_group_metadata in execute_model_req.seq_group_metadata_list: | ||
if seq_group_metadata.sampling_params.best_of > 1: | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest moving this to a method so you can add a docstring with more explanation on how this works (e.g. performance tradeoff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Thomas Parnell <[email protected]>
I added a warning when we disable speculation due to n>1 or best_of>1. |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you! |
This PR solves #6137 by disabling speculation for batches that contain any request with
best_of>1
.This approach ensures that we can handle requests with
best_of>1
without failure, but may have a downside that a single user sending requests withbest_of>1
can potentially ruin performance for other users withbest_of=1
.An alternative solution could just be to raise on those individual requests and give a message to the user like
best_of > 1 is not supported when speculative decoding is enabled
.I can also implement that if preferred. What do you think @cadedaniel @njhill ?