-
-
Couldn't load subscription status.
- Fork 10.9k
[PERF] Async structured outputs #23224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an asynchronous architecture for guided decoding to improve performance, which is a significant and well-thought-out change. The use of multiprocessing and shared memory to offload heavy computations is a solid approach. The code is generally well-structured, and the separation of concerns into Manager, Gateway, and Executor classes is clean.
I've found a critical bug in the TPU model runner and a couple of opportunities to improve performance and robustness in the shared memory handling. My comments focus on these areas.
Overall, this is a great step towards optimizing structured output generation.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Hi, thanks for the PR! Actually, the bitmask construction can be naturally overlapped with the model execution if we restructure the loop a bit: #23233 (while the RFC doesn't consider the spec decode + structured outputs case yet). WDYT? |
1f7b15f to
d676e2c
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
985277e to
e6da326
Compare
|
One performance concern is how TTFT is improved in one case (the larger JSON schema you provided), but worse in others (using the xgrammar_bench dataset of simpler JSON schemas). It seems that, at least for simpler schemas, the old threadpool executor performed initialization faster. I'm hesitant to obsess over the results too much since I do believe this direction is good, and is required for us to move forward to adding support in the async scheduler. |
|
I'm happy with this going in as-is. It's a great next step and will enable us to more easily work on the next steps of integration with async scheduling. However, I would like to wait to merge until after we release v0.11.0. It feels like a very risky change to merge now when we plan to release any day. If we merge right after, that gives us more time to ensure there are no regressions. |
likely you should increase allowed num of files open. http require one file per request. |
I think it's just from sometimes cutting the response off early so the result ends up not being valid JSON. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with Ben offline yesterday, +1 for merging after 0.11.x
| class StructuredOutputTask: | ||
|
|
||
| def __init__(self, task_type: TaskType, args: tuple, kwargs: dict): | ||
| self.task_type = task_type | ||
| self.args = args | ||
| self.kwargs = kwargs | ||
|
|
||
|
|
||
| class StructuredOutputResult: | ||
|
|
||
| def __init__(self, | ||
| task_type: TaskType, | ||
| result: Any, | ||
| error: Optional[Exception] = None): | ||
| self.task_type = task_type | ||
| self.result = result | ||
| self.error = error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit stuff: to make these into dataclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and review! I'd like to review this PR during weekend. Please hold it off until then.
- fix a bug with dp Signed-off-by: Vadim Gimpelson <[email protected]> #suppress-bc-linter
fix doc build Signed-off-by: Vadim Gimpelson <[email protected]> #suppress-bc-linter
Signed-off-by: Vadim Gimpelson <[email protected]>
|
@WoosukKwon bump on review! thank you |
Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
|
Now v0.11.0 has been released, ready to go? |
|
@simon-mo |
waiting on @WoosukKwon 's approval |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@vadiklyutiy if you've not seen it already, please check out https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1759663228844749 for instructions on resolving the merge conflicts |
|
@hmellor thank you for pointed this post. But the last 1.5 month I resolved merge conflicts in this PR 7-8 times. It is not always simple and I spend time on it. I want to hear that we go with this PR and after that will resolve everything. Hope for your understanding |
|
@WoosukKwon could you kindly clarify are you going to review this PR or not? |
|
No problem @vadiklyutiy let's wait for Woosuk's response. I just wanted to share the instructions for if you do need them. |
Yes, seems without this instruction merge would be a pian. Thanks for sharing |
|
Had a meeting. |
Purpose
Continue working on optimizing structured outputs started in #21862.
This PR implements asynchronous calculation of structured outputs to improve performance of structured output generation by moving heavy computation to a separate background process.
Key Changes:
StructuredOutputManager(Main process): Coordinates operations via multiprocessing queuesStructuredOutputGateway(Child process): Background process that receives and executes tasksStructuredOutputExecutor(Child process): Performs actual grammar compilation, bitmask generation, and token acceptanceGrammarBitmaskPlaceholderand can continue executionto reduce amount of communications
Test Plan
test3.json
Test Result
NVidia H200 GPU,
This PR
Main
Essential Elements of an Effective PR Description Checklist